[Pig Wiki] Update of "PigStreamingDesign" by Arun C Murthy

Apache Wiki Mon, 03 Mar 2008 23:15:06 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by Arun C Murthy:
http://wiki.apache.org/pig/PigStreamingDesign

------------------------------------------------------------------------------
  
  == 4. Streaming Implementation ==
  
- The main infrastructure to support the notion of data processing by sending 
dataset to a task's input and collecting its output is a generic manager who 
takes care of setup/teardown of the streaming task, manages it's 
stdin/stderr/stdout streams and also does post-processing. The plan is to 
implement a {{{org.apache.hadoop.mapred.lib.external.ExecutableManager}}} to 
take over the aforementioned responsibilities. The decision to keep that 
separate from Hadoop's Streaming component (in contrib/streaming) to ensure 
that Pig has no extraneous dependency on Hadoop streaming and am putting it 
into org.apache.hadoop.mapred.lib to ensure Pig depends on Hadoop Core only.
+ The main infrastructure to support the notion of data processing by sending 
dataset to a task's input and collecting its output is a generic manager who 
takes care of setup/teardown of the streaming task, manages it's 
stdin/stderr/stdout streams and also does post-processing. The plan is to 
implement a {{{org.apache.pig.backend.streaming.PigExecutableManager}}} to take 
over the aforementioned responsibilities. The decision to keep that separate 
from Hadoop's Streaming component (in contrib/streaming) to ensure that Pig has 
no extraneous dependency on Hadoop streaming.
  
  The `ExecutableManager` also is responsible for dealing with multiple outputs 
of the streaming tasks (refer to the functional spec in the wiki).
  
  New:
-   `org.apache.hadoop.mapred.lib.external.ExecutableManager`
+   `org.apache.pig.backend.streaming.PigExecutableManager`
  
  {{{
-   class org.apache.hadoop.mapred.lib.external.ExecutableManager {
+   class PigExecutableManager {
+ 
+     // Configure the executable-manager
      void configure() throws IOException;
+ 
+     // Runtime
      void run() throws IOException;
+ 
+     // Clean-up hook (for e.g. multiple outputs' handling etc.)
      void close() throws IOException;
-     void next(WritableComparable key, Writable value);
+ 
+     // Send the Datum to the executable
+     void add(Datum d);
    }
  }}}
  
- The important deviation from current Pig infrastructure is that there isn't a 
one-to-one mapping between inputs and output records anymore since the 
user-script could (potentially) consume ''all'' the input before it emits 
''any'' output records. The way to get around this is to wrap the 
{{{DataCollector}}} and hence the next successor in the pipleline in an 
{{{OutputCollector}}} and pass it along to the {{{ExecutableManager}}}.
+ The important deviation from current Pig infrastructure is that there isn't a 
one-to-one mapping between inputs and output records anymore since the 
user-script could (potentially) consume ''all'' the input before it emits 
''any'' output records. The way to get around this is to wrap the 
{{{DataCollector}}} and hence the next successor in the pipeline in an 
{{{OutputCollector}}} and pass it along to the {{{PigExecutableManager}}}.

[Pig Wiki] Update of "PigStreamingDesign" by Arun C Murthy

Reply via email to