[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887441#action_12887441 ]
Thejas M Nair commented on PIG-1472: ------------------------------------ bq. 1. The following code are never used in BinStorage and InterStorage, should be removed. I will remove that. bq. 3. Seems InterStorage is a replacement for BinStorage, why do we make it private? Shall we encourage user use InterStorage in the place of BinStorage, and make BinStorage deprecate? In future, we are likely to find better ways to serialize data between MR jobs of a pig query. ie the InterSedes serialization format is likely to change, and the change is not likely to be compatible with its old format. So it will not be suitable for storing persistent data. This replaces BinStorage only for its use within pig. Since BinStorage is used in pig queries and it should be easy to maintain the code, I think we don't have to deprecate BinStorage. > Optimize serialization/deserialization between Map and Reduce and between MR > jobs > --------------------------------------------------------------------------------- > > Key: PIG-1472 > URL: https://issues.apache.org/jira/browse/PIG-1472 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.8.0 > Reporter: Thejas M Nair > Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch > > > In certain types of pig queries most of the execution time is spent in > serializing/deserializing (sedes) records between Map and Reduce and between > MR jobs. > For example, if PigMix queries are modified to specify types for all the > fields in the load statement schema, some of the queries (L2,L3,L9, L10 in > pigmix v1) that have records with bags and maps being transmitted across map > or reduce boundaries run a lot longer (runtime increase of few times has been > seen. > There are a few optimizations that have shown to improve the performance of > sedes in my tests - > 1. Use smaller number of bytes to store length of the column . For example if > a bytearray is smaller than 255 bytes , a byte can be used to store the > length instead of the integer that is currently used. > 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and > DataInput.readUTF. This reduces the cost of serialization by more than 1/2. > Zebra and BinStorage are known to use DefaultTuple sedes functionality. The > serialization format that these loaders use cannot change, so after the > optimization their format is going to be different from the format used > between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.