[ https://issues.apache.org/jira/browse/TEZ-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945808#comment-13945808 ]
Siddharth Seth commented on TEZ-972: ------------------------------------ [~rajesh.balamohan] - instead of using a byte[] array to store the initial partition list, I think it's better to use a BitSet at that point. After this, the BitSet can be converted into a byte array by walking the set bits. This should be no worse than walking through the entire list of bytes. Thoughts ? Also, I think compression should just be enabled by default. Alternately - a boolean should be used in the payload to indicate whether partition has been used instead of relying on the configuration. Compression doesn't seem to add too much overhead, and in most cases reduces the data size. Relying on configuration adds overheads on users setting this up correctly on the two ends of the Edge (different configuration instances can be used for both). The current system - the test accepts a parameter for compression but doesn't use it. Also, could you please rename the private method to something other than test*. Normally, the main tests start with test* - which makes them easier to find. > Shuffle Phase - optimize memory usage of empty partition data in > DataMovementEvent > ---------------------------------------------------------------------------------- > > Key: TEZ-972 > URL: https://issues.apache.org/jira/browse/TEZ-972 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Attachments: TEZ-972-v1.patch, TEZ-972-v2.patch > > > Empty partition details are stored in byte[] in compressed format and sent > via DataMovementEvent in shuffle phase. Quick standalone tests reveals that > BitSet would be more efficient than compressing the byte[]. > PartitionSize=1 , BitSetSize=1 , CompressedBitSetSize=9 , > NormalByteArrayCompressed=9 > PartitionSize=101 , BitSetSize=13 , CompressedBitSetSize=22 , > NormalByteArrayCompressed=42 > PartitionSize=201 , BitSetSize=26 , CompressedBitSetSize=37 , > NormalByteArrayCompressed=62 > PartitionSize=301 , BitSetSize=38 , CompressedBitSetSize=49 , > NormalByteArrayCompressed=76 > .. > PartitionSize=1001 , BitSetSize=126 , CompressedBitSetSize=137 , > NormalByteArrayCompressed=197 > .. > PartitionSize=2001 , BitSetSize=251 , CompressedBitSetSize=262 , > NormalByteArrayCompressed=374 > PartitionSize=4001 , BitSetSize=501 , CompressedBitSetSize=512 , > NormalByteArrayCompressed=686 > PartitionSize=8001 , BitSetSize=1001 , CompressedBitSetSize=1012 , > NormalByteArrayCompressed=1330 > PartitionSize=16001 , BitSetSize=2001 , CompressedBitSetSize=1979 , > NormalByteArrayCompressed=2569 > PartitionSize=32001 , BitSetSize=4001 , CompressedBitSetSize=3885 , > NormalByteArrayCompressed=5000 > -This is based on considering random bit positions as empty partitions. > It is not possible to directly use JDK 1.6's BitSet directly as it does not > support valueOf, toByteArray() functions. Suggestion is to have Tez specific > BitSet (until Tez moves to JDK 1.7) and make the compression as a job > configuration. -- This message was sent by Atlassian JIRA (v6.2#6252)