Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by Arun C Murthy: http://wiki.apache.org/pig/PigStreamingDesign ------------------------------------------------------------------------------ == 4. Streaming Implementation == - The main infrastructure to support the notion of data processing by sending dataset to a task's input and collecting its output is a generic manager who takes care of setup/teardown of the streaming task, manages it's stdin/stderr/stdout streams and also does post-processing. The plan is to implement a {{{org.apache.hadoop.mapred.lib.external.ExecutableManager}}} to take over the aforementioned responsibilities. The decision to keep that separate from Hadoop's Streaming component (in contrib/streaming) to ensure that Pig has no extraneous dependency on Hadoop streaming and am putting it into org.apache.hadoop.mapred.lib to ensure Pig depends on Hadoop Core only. + The main infrastructure to support the notion of data processing by sending dataset to a task's input and collecting its output is a generic manager who takes care of setup/teardown of the streaming task, manages it's stdin/stderr/stdout streams and also does post-processing. The plan is to implement a {{{org.apache.pig.backend.streaming.PigExecutableManager}}} to take over the aforementioned responsibilities. The decision to keep that separate from Hadoop's Streaming component (in contrib/streaming) to ensure that Pig has no extraneous dependency on Hadoop streaming. The `ExecutableManager` also is responsible for dealing with multiple outputs of the streaming tasks (refer to the functional spec in the wiki). New: - `org.apache.hadoop.mapred.lib.external.ExecutableManager` + `org.apache.pig.backend.streaming.PigExecutableManager` {{{ - class org.apache.hadoop.mapred.lib.external.ExecutableManager { + class PigExecutableManager { + + // Configure the executable-manager void configure() throws IOException; + + // Runtime void run() throws IOException; + + // Clean-up hook (for e.g. multiple outputs' handling etc.) void close() throws IOException; - void next(WritableComparable key, Writable value); + + // Send the Datum to the executable + void add(Datum d); } }}} - The important deviation from current Pig infrastructure is that there isn't a one-to-one mapping between inputs and output records anymore since the user-script could (potentially) consume ''all'' the input before it emits ''any'' output records. The way to get around this is to wrap the {{{DataCollector}}} and hence the next successor in the pipleline in an {{{OutputCollector}}} and pass it along to the {{{ExecutableManager}}}. + The important deviation from current Pig infrastructure is that there isn't a one-to-one mapping between inputs and output records anymore since the user-script could (potentially) consume ''all'' the input before it emits ''any'' output records. The way to get around this is to wrap the {{{DataCollector}}} and hence the next successor in the pipeline in an {{{OutputCollector}}} and pass it along to the {{{PigExecutableManager}}}.