[jira] Updated: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

Ying He (JIRA) Thu, 03 Dec 2009 12:19:46 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ying He updated PIG-480:
------------------------

    Attachment: PIG_480.patch

patch to use identity map. 

An IdentityMapOptimizer is applied when a MR plan contains at least 2 MRs.  It 
evaluates each MR job, if its reducer uses POStore to dump a tmp file, and the 
mapper of next MR only contains a POLocalRearrange and a POLoad to load the tmp 
file,  then the POLocalRearrange of next mapper is moved up to the reducer of 
this MR, and the mapper of next MR job is changed to use identity map.

In this case, the reducer of the MR job output (key, tuple) pairs to the tmp 
file by using a different OutputFormat, PigBinaryValueOutputFormat.  It uses a 
different record writer to dump data, the format is

delimiter (3 bytes,, 0x01, 0x02, 0x03)
key
length of byte[] for tuple
byte[] for tuple

the next MR job that uses identity map uses a different InputFormat, 
PigBinaryValueInputFormat, which returns a different RecordReader, to read in 
data as (key, tuple) pairs. But the tuple is kept in byte[] form.  The identity 
map does nothing except passing the (key, tuple) through and writing them to 
disk. When reducer picks them up, the tuple is de-serialized  for processing. 

The reason of doing this is performance. Because the tuple reading in and 
writing out of identity map are in byte[] form, we saved a de-serialization and 
serialization of tuples in mapper.

A use case is  following:

a = load 'f' as (id, v);
b = load 's' as (id, v);
c = join a by id, b by id;
d = group c by a::id;
dump d;

this example  contains 2 MR jobs. After optimization, the first job output 
(key, tuple) pairs, and second job uses identity map.


> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> -------------------------------------------------------
>
>                 Key: PIG-480
>                 URL: https://issues.apache.org/jira/browse/PIG-480
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.0
>            Reporter: Olga Natkovich
>         Attachments: PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in 
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty 
> map job because it doesn't parse the data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs

Reply via email to