David Ciemiewicz commented on PIG-506:
This seems much cleaner way to set up native Hadoop map-reduce jobs than the
command line interfaces people use today. Might be worth it just for that
I think you'd need to gather some examples from non-Pig users and prototype
them as Pig/NATIVE scripts to demonstrate what the advantages would be.
For me, as a primary Pig user, there is some appeal because I could benefit
from borrowing other's code.
> Does pig need a NATIVE keyword?
> Key: PIG-506
> URL: https://issues.apache.org/jira/browse/PIG-506
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Reporter: Alan Gates
> Assignee: Alan Gates
> Priority: Minor
> Assume a user had a job that broke easily into three pieces. Further assume
> that pieces one and three were easily expressible in pig, but that piece two
> needed to be written in map reduce for whatever reason (performance,
> something that pig could not easily express, legacy job that was too
> important to change, etc.). Today the user would either have to use map
> reduce for the entire job or manually handle the stitching together of pig
> and map reduce jobs. What if instead pig provided a NATIVE keyword that
> would allow the script to pass off the data stream to the underlying system
> (in this case map reduce). The semantics of NATIVE would vary by underlying
> system. In the map reduce case, we would assume that this indicated a
> collection of one or more fully contained map reduce jobs, so that pig would
> store the data, invoke the map reduce jobs, and then read the resulting data
> to continue. It might look something like this:
> A = load 'myfile';
> X = load 'myotherfile';
> B = group A by $0;
> C = foreach B generate group, myudf(B);
> D = native (jar=mymr.jar, infile=frompig outfile=topig);
> E = join D by $0, X by $0;
> This differs from streaming in that it allows the user to insert an arbitrary
> amount of native processing, whereas streaming allows the insertion of one
> binary. It also differs in that, for streaming, data is piped directly into
> and out of the binary as part of the pig pipeline. Here the pipeline would
> be broken, data written to disk, and the native block invoked, then data read
> back from disk.
> Another alternative is to say this is unnecessary because the user can do the
> coordination from java, using the PIgServer interface to run pig and calling
> the map reduce job explicitly. The advantages of the native keyword are that
> the user need not be worried about coordination between the jobs, pig will
> take care of it. Also the user can make use of existing java applications
> without being a java programmer.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.