Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=2&rev2=3

--------------------------------------------------

  <<Navigation(children)>>
  <<TableOfContents>>
  
- This document captures the specification for native map reduce jobs and 
proposal for executing native mapreduce jobs inside pig script. This is tracked 
at [[#ref1|Jira]].
+ This document captures the specification for native map reduce jobs and 
proposal for executing native mapreduce jobs inside pig script. This is tracked 
at [[#ref1|PIG-506]].
  
  == Introduction ==
- Pig needs to provide a way to natively run map reduce jobs written in java 
language. 
+ Pig needs to provide a way to natively run map reduce jobs written in java 
language.
  There are some advantages of this-
   1. The advantages of the ''native'' keyword are that the user need not be 
worried about coordination between the jobs, pig will take care of it.
   2. User can make use of existing java applications without being a java 
programmer.
@@ -24, +24 @@

  Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' 
USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ];
  }}}
  
- This stores '''X''' into the '''storeLocation''' which is used by native 
mapreduce to read its data. After we run mymr.jar's mapreduce we load back the 
data from '''loadLocation''' into alias '''Y'''.
+ This stores '''X''' into the '''storeLocation''' using '''storeFunc''', which 
is then used by native mapreduce to read its data. After we run mymr.jar's 
mapreduce, we load back the data from '''loadLocation''' into alias '''Y''' 
using '''loadFunc'''.
+ 
+ params are extra parameters required for native mapreduce job (TBD).
+ 
+ mymr.jar is complaint with pig specification (see below).
  
  == Comparison with similar features ==
  === Pig Streaming ===
+ Purpose of [[#ref2|pig streaming]] is to send data through an external script 
or program to transform a dataset into a different dataset based on a custom 
script written in any programming/scripting language. Pig streaming uses 
support of hadoop streaming to achieve this. Pig can register custom programs 
in a script, inline in the stream clause or using a define clause. Pig also 
provides a level of data guarantees on the data processing, provides feature 
for job management, provides ability to use distributed cache for the scripts 
(configurable). Streaming application run locally on individual mapper and 
reducer nodes.
  
- === Hive Transform ===
+ === Hive Transforms ===
+ With [[#ref3|hive transforms]], users can also plug in their own custom 
mappers and reducers in the data stream. Basically, it is also an application 
of custom streaming supported by hadoop. Thus, these mappers and reducers can 
be written in any scripting languages and can be registered to distributed 
cache to help performance. To support custom map reduce programs written in 
java ([[#ref4|bezo's blog]]), we can use our custom mappers and reducers as 
data streaming functions and use them to transform the data using 'java -cp 
mymr.jar'. This will not invoke a map reduce task but will attempt to transform 
the data during the map or the reduce task (locally).
+ 
+ Thus, both these features can transform data submitted to a map reduce job 
(mapper) into a different data set and/or transform data produced by a 
mapreduce job (reducer) into a different data set. But we should notice that 
data tranformation takes on a single machine and does not take advantage of map 
reduce framework itself. Also, these blocks only allow custom transformations 
inside the data pipeline and does not break the pipeline.
+ 
+ With native job support, pig can support native map reduce jobs written in 
java language that can convert a data set into a different data set after 
applying a custom map reduce function of any complexity.
  
  == Native Mapreduce job specification ==
- Native Mapreduce job needs to conform to some specification defined by Pig. 
Pig specifies the input and output directory for this job and is responsible 
for 
+ Native Mapreduce job needs to conform to some specification defined by Pig. 
Pig specifies the input and output directory in the script for this job and is 
responsible for managing the coordination of the native job with the remaining 
pig mapreduce jobs. To allow pig to communicate with native map reduce job
+ 1. Ordered inputLoc/outputLoc parameters- 
  
+ 2. getJobConf Function-
  
  == Implementation Details ==
  
@@ -42, +54 @@

   1. <<Anchor(ref1)>> PIG-506, "Does pig need a NATIVE keyword?", 
https://issues.apache.org/jira/browse/PIG-506
   2. <<Anchor(ref2)>> Pig Wiki, "Pig Streaming Functional Specification", 
http://wiki.apache.org/pig/PigStreamingFunctionalSpec
   3. <<Anchor(ref3)>> Hive Wiki, "Transform/Map-Reduce Syntax", 
http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
+  4. <<Anchor(ref4)>> Bizos blog, "hive map reduce in java" 
http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html
  

Reply via email to