Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The "NativeMapReduce" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=2&rev2=3 -------------------------------------------------- <<Navigation(children)>> <<TableOfContents>> - This document captures the specification for native map reduce jobs and proposal for executing native mapreduce jobs inside pig script. This is tracked at [[#ref1|Jira]]. + This document captures the specification for native map reduce jobs and proposal for executing native mapreduce jobs inside pig script. This is tracked at [[#ref1|PIG-506]]. == Introduction == - Pig needs to provide a way to natively run map reduce jobs written in java language. + Pig needs to provide a way to natively run map reduce jobs written in java language. There are some advantages of this- 1. The advantages of the ''native'' keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. 2. User can make use of existing java applications without being a java programmer. @@ -24, +24 @@ Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ]; }}} - This stores '''X''' into the '''storeLocation''' which is used by native mapreduce to read its data. After we run mymr.jar's mapreduce we load back the data from '''loadLocation''' into alias '''Y'''. + This stores '''X''' into the '''storeLocation''' using '''storeFunc''', which is then used by native mapreduce to read its data. After we run mymr.jar's mapreduce, we load back the data from '''loadLocation''' into alias '''Y''' using '''loadFunc'''. + + params are extra parameters required for native mapreduce job (TBD). + + mymr.jar is complaint with pig specification (see below). == Comparison with similar features == === Pig Streaming === + Purpose of [[#ref2|pig streaming]] is to send data through an external script or program to transform a dataset into a different dataset based on a custom script written in any programming/scripting language. Pig streaming uses support of hadoop streaming to achieve this. Pig can register custom programs in a script, inline in the stream clause or using a define clause. Pig also provides a level of data guarantees on the data processing, provides feature for job management, provides ability to use distributed cache for the scripts (configurable). Streaming application run locally on individual mapper and reducer nodes. - === Hive Transform === + === Hive Transforms === + With [[#ref3|hive transforms]], users can also plug in their own custom mappers and reducers in the data stream. Basically, it is also an application of custom streaming supported by hadoop. Thus, these mappers and reducers can be written in any scripting languages and can be registered to distributed cache to help performance. To support custom map reduce programs written in java ([[#ref4|bezo's blog]]), we can use our custom mappers and reducers as data streaming functions and use them to transform the data using 'java -cp mymr.jar'. This will not invoke a map reduce task but will attempt to transform the data during the map or the reduce task (locally). + + Thus, both these features can transform data submitted to a map reduce job (mapper) into a different data set and/or transform data produced by a mapreduce job (reducer) into a different data set. But we should notice that data tranformation takes on a single machine and does not take advantage of map reduce framework itself. Also, these blocks only allow custom transformations inside the data pipeline and does not break the pipeline. + + With native job support, pig can support native map reduce jobs written in java language that can convert a data set into a different data set after applying a custom map reduce function of any complexity. == Native Mapreduce job specification == - Native Mapreduce job needs to conform to some specification defined by Pig. Pig specifies the input and output directory for this job and is responsible for + Native Mapreduce job needs to conform to some specification defined by Pig. Pig specifies the input and output directory in the script for this job and is responsible for managing the coordination of the native job with the remaining pig mapreduce jobs. To allow pig to communicate with native map reduce job + 1. Ordered inputLoc/outputLoc parameters- + 2. getJobConf Function- == Implementation Details == @@ -42, +54 @@ 1. <<Anchor(ref1)>> PIG-506, "Does pig need a NATIVE keyword?", https://issues.apache.org/jira/browse/PIG-506 2. <<Anchor(ref2)>> Pig Wiki, "Pig Streaming Functional Specification", http://wiki.apache.org/pig/PigStreamingFunctionalSpec 3. <<Anchor(ref3)>> Hive Wiki, "Transform/Map-Reduce Syntax", http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform + 4. <<Anchor(ref4)>> Bizos blog, "hive map reduce in java" http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html