[ https://issues.apache.org/jira/browse/TEZ-972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rajesh Balamohan updated TEZ-972: --------------------------------- Attachment: TEZ-972-v3.patch >> instead of using a byte[] array to store the initial partition list, I think >> it's better to use a BitSet at that point. -Fixed. Added lightweight functions in TezUtils toByteArray(), fromByteArray(). >> Also, I think compression should just be enabled by default. >> Compression doesn't seem to add too much overhead, and in most cases reduces >> the data size. Relying on configuration adds overheads on users setting this >> up correctly on the two ends of the Edge (different configuration instances >> can be used for both). - Compression adds overhead in corner cases. E.g if the bitset size is 1, compressed size would be 9 instead of 1. This was the reason for including it in conf. But given the config overhead you mentioned, I have removed config related changes. > Shuffle Phase - optimize memory usage of empty partition data in > DataMovementEvent > ---------------------------------------------------------------------------------- > > Key: TEZ-972 > URL: https://issues.apache.org/jira/browse/TEZ-972 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Attachments: TEZ-972-v1.patch, TEZ-972-v2.patch, TEZ-972-v3.patch > > > Empty partition details are stored in byte[] in compressed format and sent > via DataMovementEvent in shuffle phase. Quick standalone tests reveals that > BitSet would be more efficient than compressing the byte[]. > PartitionSize=1 , BitSetSize=1 , CompressedBitSetSize=9 , > NormalByteArrayCompressed=9 > PartitionSize=101 , BitSetSize=13 , CompressedBitSetSize=22 , > NormalByteArrayCompressed=42 > PartitionSize=201 , BitSetSize=26 , CompressedBitSetSize=37 , > NormalByteArrayCompressed=62 > PartitionSize=301 , BitSetSize=38 , CompressedBitSetSize=49 , > NormalByteArrayCompressed=76 > .. > PartitionSize=1001 , BitSetSize=126 , CompressedBitSetSize=137 , > NormalByteArrayCompressed=197 > .. > PartitionSize=2001 , BitSetSize=251 , CompressedBitSetSize=262 , > NormalByteArrayCompressed=374 > PartitionSize=4001 , BitSetSize=501 , CompressedBitSetSize=512 , > NormalByteArrayCompressed=686 > PartitionSize=8001 , BitSetSize=1001 , CompressedBitSetSize=1012 , > NormalByteArrayCompressed=1330 > PartitionSize=16001 , BitSetSize=2001 , CompressedBitSetSize=1979 , > NormalByteArrayCompressed=2569 > PartitionSize=32001 , BitSetSize=4001 , CompressedBitSetSize=3885 , > NormalByteArrayCompressed=5000 > -This is based on considering random bit positions as empty partitions. > It is not possible to directly use JDK 1.6's BitSet directly as it does not > support valueOf, toByteArray() functions. Suggestion is to have Tez specific > BitSet (until Tez moves to JDK 1.7) and make the compression as a job > configuration. -- This message was sent by Atlassian JIRA (v6.2#6252)