Hi, We got some requirements from a Hadoop Streaming user that would like to see streaming available in Pig. Here are his requirements: ================================================================ Features that will help to make Pig the standard the preferred way to use Hadoop.
We need to look at 3 set of features: -- immediate added value -- things that become easy (can calculate the total deposit amount) -- cannot do without (have to do OCR in order to get the deposit amount) -- real added value -- things that are very hard or impossible to do otherwise The "immediate added value" is -- to be able to fully specify a hadoop-based (probably consisting of multiple steps) in a single script, without writing many redundant and cryptic character sequences. -- the data processing algorithms themselves are expressed in any language of the suer choice, without any adaptation to Hadoop. The "cannot do without" features the things that Hadoop Streaming users are already using -- either provided by the current infrastructure, or by tools and hacks they have put together themselves. 1. -- a simple streaming program does not use any extra concepts -- specifying a sequence of steps in one script (mostly there) -- input and output data set (directory) names should not need to be hard coded n the Pig Script. It should be possible to combine configuration parameters to define the input and working directory names (DFS) -- error checking -- stop the execution if a step failed (may be not- trivial in case of screaming, as a streaming step may have its own way to indicate a failure) -- meaningful but over-ridable defaults (a lot of specifics -- needs a separate discussion) 2. "stderr" -- available before the job (the step) has completed -- configurable -- by default only some standard environment summary (command, input name, available disk space, start & end time, number of processed records) plus all the user stderr goes there -- available in DFS (the name has a useful default -- e.g. based on the name of the output) -- (advanced) deliver the error --e.g. synax errors -- messages to the client 3. Map / reduce command -- may be a command line (not just "executable") -- command line parameters may be calculated within the Pig script; in particular, they may come from the command line that invoked the Pig script -- the command may have multiple levels of quotes and other special chartacters -- the step can be defined by an existing Java class, rather than by a Unix command 4. input -- allow to get the input without any transformation -- allow to use any existing InputFormat class as the input transformation -- the program may require to take the input from a named files, (rather than stdin) -- secondary sort 5. output -- the program may write its main output to a named "file", rather that to standard output -- secondary output -- sorted output in a single file 6. files -- the files that should be shipped together with the executables for a given step can be specified inside the Pig script REAL ADDED VALUE FEATURES 7. Efficient joins -- for all important variations of join 8. Metadata about files (schema) 9. Support for counters
