[ 
https://issues.apache.org/jira/browse/TEZ-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945808#comment-13945808
 ] 

Siddharth Seth commented on TEZ-972:
------------------------------------

[~rajesh.balamohan] - instead of using a byte[] array to store the initial 
partition list, I think it's better to use a BitSet at that point. After this, 
the BitSet can be converted into a byte array by walking the set bits. This 
should be no worse than walking through the entire list of bytes. Thoughts ?

Also, I think compression should just be enabled by default. Alternately - a 
boolean should be used in the payload to indicate whether partition has been 
used instead of relying on the configuration. Compression doesn't seem to add 
too much overhead, and in most cases reduces the data size. Relying on 
configuration adds overheads on users setting this up correctly on the two ends 
of the Edge (different configuration instances can be used for both).

The current system - the test accepts a parameter for compression but doesn't 
use it. Also, could you please rename the private method to something other 
than test*. Normally, the main tests start with test* - which makes them easier 
to find.

> Shuffle Phase - optimize memory usage of empty partition data in 
> DataMovementEvent
> ----------------------------------------------------------------------------------
>
>                 Key: TEZ-972
>                 URL: https://issues.apache.org/jira/browse/TEZ-972
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-972-v1.patch, TEZ-972-v2.patch
>
>
> Empty partition details are stored in byte[] in compressed format and sent 
> via DataMovementEvent in shuffle phase.  Quick standalone tests reveals that 
> BitSet would be more efficient than compressing the byte[].  
> PartitionSize=1 , BitSetSize=1 , CompressedBitSetSize=9 , 
> NormalByteArrayCompressed=9
> PartitionSize=101 , BitSetSize=13 , CompressedBitSetSize=22 , 
> NormalByteArrayCompressed=42
> PartitionSize=201 , BitSetSize=26 , CompressedBitSetSize=37 , 
> NormalByteArrayCompressed=62
> PartitionSize=301 , BitSetSize=38 , CompressedBitSetSize=49 , 
> NormalByteArrayCompressed=76
> ..
> PartitionSize=1001 , BitSetSize=126 , CompressedBitSetSize=137 , 
> NormalByteArrayCompressed=197
> ..
> PartitionSize=2001 , BitSetSize=251 , CompressedBitSetSize=262 , 
> NormalByteArrayCompressed=374
> PartitionSize=4001 , BitSetSize=501 , CompressedBitSetSize=512 , 
> NormalByteArrayCompressed=686
> PartitionSize=8001 , BitSetSize=1001 , CompressedBitSetSize=1012 , 
> NormalByteArrayCompressed=1330
> PartitionSize=16001 , BitSetSize=2001 , CompressedBitSetSize=1979 , 
> NormalByteArrayCompressed=2569
> PartitionSize=32001 , BitSetSize=4001 , CompressedBitSetSize=3885 , 
> NormalByteArrayCompressed=5000
> -This is based on considering random bit positions as empty partitions.
> It is not possible to directly use JDK 1.6's BitSet directly as it does not 
> support valueOf, toByteArray() functions.  Suggestion is to have Tez specific 
> BitSet (until Tez moves to JDK 1.7) and make the compression as a job 
> configuration.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to