David Ciemiewicz commented on PIG-506:


This seems much cleaner way to set up native Hadoop map-reduce jobs than the 
command line interfaces people use today.  Might be worth it just for that 

I think you'd need to gather some examples from non-Pig users and prototype 
them as Pig/NATIVE scripts to demonstrate what the advantages would be.

For me, as a primary Pig user, there is some appeal because I could benefit 
from borrowing other's code.

> Does pig need a NATIVE keyword?
> -------------------------------
>                 Key: PIG-506
>                 URL: https://issues.apache.org/jira/browse/PIG-506
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Minor
> Assume a user had a job that broke easily into three pieces.  Further assume 
> that pieces one and three were easily expressible in pig, but that piece two 
> needed to be written in map reduce for whatever reason (performance, 
> something that pig could not easily express, legacy job that was too 
> important to change, etc.).  Today the user would either have to use map 
> reduce for the entire job or manually handle the stitching together of pig 
> and map reduce jobs.  What if instead pig provided a NATIVE keyword that 
> would allow the script to pass off the data stream to the underlying system 
> (in this case map reduce).  The semantics of NATIVE would vary by underlying 
> system.  In the map reduce case, we would assume that this indicated a 
> collection of one or more fully contained map reduce jobs, so that pig would 
> store the data, invoke the map reduce jobs, and then read the resulting data 
> to continue.  It might look something like this:
> {code}
> A = load 'myfile';
> X = load 'myotherfile';
> B = group A by $0;
> C = foreach B generate group, myudf(B);
> D = native (jar=mymr.jar, infile=frompig outfile=topig);
> E = join D by $0, X by $0;
> ...
> {code}
> This differs from streaming in that it allows the user to insert an arbitrary 
> amount of native processing, whereas streaming allows the insertion of one 
> binary.  It also differs in that, for streaming, data is piped directly into 
> and out of the binary as part of the pig pipeline.  Here the pipeline would 
> be broken, data written to disk, and the native block invoked, then data read 
> back from disk.
> Another alternative is to say this is unnecessary because the user can do the 
> coordination from java, using the PIgServer interface to run pig and calling 
> the map reduce job explicitly.  The advantages of the native keyword are that 
> the user need not be worried about coordination between the jobs, pig will 
> take care of it.  Also the user can make use of existing java applications 
> without being a java programmer.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to