Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig

--------------------------------------------------

New page:
= Making Pig Latin Turing Complete =
== Introduction ==
As more users adopt Pig and begin writing their data processing in Pig Latin 
and as they use Pig to process more and more complex
tasks, a consistent request from these users is to add branches, loops, and 
functions to Pig Latin.  This will enable Pig Latin to
process a whole new class of problems.  Consider, for example, an algorithm 
that needs to iterate until an error estimate is less
than a given threshold.  This might look like (this just suggests logic, not 
syntax):

{{{
    error = 100.0;
    infile = 'original.data';
    while (error > 1.0) {
        A = load 'infile';
        B = group A all;
        C = foreach B generate flatten(doSomeCalculation(A)) as (result, error);
        error = foreach C generate error;
        store C into 'outfile';
        if (error > 1.0) mv 'outfile' 'infile';
    }
}}}

== Requirements ==
The following should be provided by this Turing complete Pig Latin:
 1. Branching.  This will be satisfied by a standard `if` `else if` `else` 
functionality
 1. Looping.  This should include standard `while` and some form of `for`.  for 
could be C style or Python style (foreach).  Care needs to be taken to select 
syntax that does not cause confusion with the existing `foreach` operator in 
Pig Latin.
 1. Functions.  
 1. Modules.
 1. The ability to use local in memory variables in the Pig Latin script.  For 
example, in the snippet given above the way `infile` is defined above the 
`while` and then used in the `load`.
 1. The ability to "store" results into local in memory variables.  For 
example, in the snippet given above the way the error calculation from the data 
processing is stored into `error` in the line `error = foreach C generate 
error;`.

== Approach ==
There are two possible approaches to this.  One is to add all of these features 
to Pig Latin itself.  This has several advantages:
 * All Pig Latin operations will be first class objects in the language.  There 
will not be a need to do quoted programming, like what happens when JDBC is 
used to write SQL inside a Java program.
 * There will be opportunities to do optimizations that are not available in 
embedded programming, such as loop unrolling, etc.

However, the cost of this approach is incredible.  It means turning Pig Latin 
into a full scripting language.  And it means
all kinds of tools like debuggers, etc. will never be available for Pig Latin 
users because the Pig team will not have the resources
or expertise to develop and maintain such tools.  And finally, does the world 
need another scripting language that starts with P?

The second possible approach to this is to embed Pig Latin into an existing 
scripting language, such as Perl, Python, Ruby, etc.  The
advantages of this are:
 * Most of the requirements noted above (branching, looping, functions, and 
modules) are present in these languages.
 * For any of these languages whole hosts of tools such as debuggers, IDEs, 
etc. exist and could be used.
 * Users do not have to learn a new language.

There are a few significant drawbacks to this approach:
 * It leads to a quoted programming style which is unnatural and irritating for 
developers.
 * Which scripting language to choose?  Perl, Python, and Ruby all have 
significant adoption and could make a claim to be the best choice.
 * Syntactic and semantic checking is usually delayed until an embedded bit of 
code is reached in the outer control flow.  Given that Pig jobs can run for 
hours this can mean spending hours to discover a simple typo.

Consider for example if built a python class that wrapped !PigServer and then 
translated the above code snippet.

{{{
    error = 100.0
    infile = 'original.data'
    pig = PigServer()
    grunt = Grunt()
    while error > 1.0:
        pig.registerQuery("A = load 'infile'; \
                           B = group A all; \
                           C = foreach B generate flatten(doSomeCalculation(A)) 
as (result, error); \
        PigIterator pi = pig.openIterator("C", 'outfile');
        output = grunt.exec("fs cat 'outfile'");
        bla = output.partition("\t");
        error = bla(2)
        if error >= 1.0:
            grunt.exec("fs mv 'outfile' 'infile'")
}}}

All of these references to `pig` and `grunt` as objects with command strings is 
undesirable.
So while I believe that embedding is a much better approach due to the lower 
work load and the plethora of tools available for other
languages, I do not believe the above is an acceptable way to do it.  Thus I 
would like to place three additional requirements on
embedded Pig Latin beyond those given above for Turing complete Pig Latin:
 1.#7 Pig Latin should appear as a natural part of the language it is embedded 
in, not as quoted strings.
 1. Syntactic and semantic checking should be done up front before the script 
is run.
 1. It should be possible to write UDFs in the scripting language that Pig 
Latin is embedded in and reference them from Pig Latin.

This overcomes two of the three drawbacks noted above.  It does not provide for 
a way to do certain optimizations such as loop
unrolling, but I think that is acceptable.

What might this look like?  Again using the script snippet at the top and 
embedding it in Jython, this might look like:
{{{
    error = 100.0
    infile = 'original.data'
    while error > 1.0:
        PIG:
            A = load '$infile';
            B = group A all;
            C = foreach B generate flatten(doSomeCalculation(A)) as (result, 
error);
            $error = foreach C generate error;
            store C into 'outfile';
        if error > 1.0:
            PIG:
                fs mv 'outfile' 'infile';
            infile = 'infile';

    def doSomeCalculation(A):
        for x in A:
            ...
}}}

A preprocessor could then be applied to the above that would convert this to a 
form of Jython that the embedding functionality provided as part of
the `pyg.tgz` patch already attached to 
[[https://issues.apache.org/jira/browse/PIG-928|PIG-928]] can run.  In other 
words the above would be
converted into something like the second example that uses the !PigServer 
interface.  This preprocessor could also submit the
Pig Latin portions of the script for syntactic and semantic checking.  

This preprocessor would find the Pig Latin segments of code via the `PIG:` tag. 
 `PIG:` would have the same block scoping rules as
other block operators in Python.  Variables from the Python code would be 
imported to and exported from the Pig Latin via parameter
substitution syntax (e.g. notice how variables `infile` and `error` appear as 
`$infile` and `$error` inside the Pig Latin).

The last drawback that this proposal does not address is that we have to pick a 
particular
scripting language to embed Pig Latin in.  There are two solution I can see 
here:
 1. Do our best to pick a good one and live with people's unhappiness.
 1. Write the preprocessor in such a way that it is relatively easy to switch 
languages it would embed Pig Latin into.

While option two sounds nice it greatly complicates the project.  And 
realistically how many people are going to take on all the
work to port it to another language?  Based on this I suggest we go with option 
one and choose Jython as the scripting language.  I
vote for Jython for two reasons.  One, in order to meet the goal of embedding 
functions from the scripting language directly as UDFs
we have to pick a language that compiles to Java byte code.  That leaves us 
with Jython, Jruby, Groovy, or !JavaScript.  Out of that
field we already have half of the implementation we need in Jython with 
[[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]. 

Reply via email to