[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887265#action_12887265 ]
Daniel Dai commented on PIG-1472: --------------------------------- Patch looks good. Couple of comments: 1. The following code are never used in BinStorage and InterStorage, should be removed. {code} public static final int RECORD_1 = 0x01; public static final int RECORD_2 = 0x02; public static final int RECORD_3 = 0x03; {code} 2. In BinInterSedes, why do we have type "GENERIC_WRITABLECOMPARABLE"? When it will be used? 3. Seems InterStorage is a replacement for BinStorage, why do we make it private? Shall we encourage user use InterStorage in the place of BinStorage, and make BinStorage deprecate? > Optimize serialization/deserialization between Map and Reduce and between MR > jobs > --------------------------------------------------------------------------------- > > Key: PIG-1472 > URL: https://issues.apache.org/jira/browse/PIG-1472 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.8.0 > Reporter: Thejas M Nair > Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch > > > In certain types of pig queries most of the execution time is spent in > serializing/deserializing (sedes) records between Map and Reduce and between > MR jobs. > For example, if PigMix queries are modified to specify types for all the > fields in the load statement schema, some of the queries (L2,L3,L9, L10 in > pigmix v1) that have records with bags and maps being transmitted across map > or reduce boundaries run a lot longer (runtime increase of few times has been > seen. > There are a few optimizations that have shown to improve the performance of > sedes in my tests - > 1. Use smaller number of bytes to store length of the column . For example if > a bytearray is smaller than 255 bytes , a byte can be used to store the > length instead of the integer that is currently used. > 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and > DataInput.readUTF. This reduces the cost of serialization by more than 1/2. > Zebra and BinStorage are known to use DefaultTuple sedes functionality. The > serialization format that these loaders use cannot change, so after the > optimization their format is going to be different from the format used > between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.