[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Thejas M Nair (JIRA) Fri, 09 Jul 2010 11:52:46 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886820#action_12886820
 ]


Thejas M Nair commented on PIG-1472:
------------------------------------

The audit warning diff looks bogus. The contrib tests passed when i ran them on 
my machine, failures seem to be caused by hudson environment.

The changes in PIG-1295 will need to be ported to work with this new 
serialization format. For that patch, I think we should introduce a new 
functions in InterSedes that can compare two serialized tuples. Also add a 
function to BinSedesTuple that returns corresponding InterSedes class. 
Then while selecting the comparator, add a check to see if the default tuple 
type is BinSedesTuple, if yes, use the corresponding InterSedes function as the 
comparator class.  


> Optimize serialization/deserialization between Map and Reduce and between MR 
> jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in 
> serializing/deserializing (sedes) records between Map and Reduce and between 
> MR jobs. 
> For example, if PigMix queries are modified to specify types for all the 
> fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
> pigmix v1) that have records with bags and maps being transmitted across map 
> or reduce boundaries run a lot longer (runtime increase of few times has been 
> seen.
> There are a few optimizations that have shown to improve the performance of 
> sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if 
> a bytearray is smaller than 255 bytes , a byte can be used to store the 
> length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
> DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
> serialization format that these loaders use cannot change, so after the 
> optimization their format is going to be different from the format used 
> between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Reply via email to