[ 
https://issues.apache.org/jira/browse/TEZ-972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-972:
---------------------------------

    Attachment: TEZ-972-v3.patch

>> instead of using a byte[] array to store the initial partition list, I think 
>> it's better to use a BitSet at that point. 
-Fixed. Added lightweight functions in TezUtils toByteArray(), fromByteArray().

>> Also, I think compression should just be enabled by default. 
>> Compression doesn't seem to add too much overhead, and in most cases reduces 
>> the data size. Relying on configuration adds overheads on users setting this 
>> up correctly on the two ends of the Edge (different configuration instances 
>> can be used for both).

- Compression adds overhead in corner cases.  E.g if the bitset size is 1, 
compressed size would be 9 instead of 1.  This was the reason for including it 
in conf.  But given the config overhead you mentioned, I have removed config 
related changes.


> Shuffle Phase - optimize memory usage of empty partition data in 
> DataMovementEvent
> ----------------------------------------------------------------------------------
>
>                 Key: TEZ-972
>                 URL: https://issues.apache.org/jira/browse/TEZ-972
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-972-v1.patch, TEZ-972-v2.patch, TEZ-972-v3.patch
>
>
> Empty partition details are stored in byte[] in compressed format and sent 
> via DataMovementEvent in shuffle phase.  Quick standalone tests reveals that 
> BitSet would be more efficient than compressing the byte[].  
> PartitionSize=1 , BitSetSize=1 , CompressedBitSetSize=9 , 
> NormalByteArrayCompressed=9
> PartitionSize=101 , BitSetSize=13 , CompressedBitSetSize=22 , 
> NormalByteArrayCompressed=42
> PartitionSize=201 , BitSetSize=26 , CompressedBitSetSize=37 , 
> NormalByteArrayCompressed=62
> PartitionSize=301 , BitSetSize=38 , CompressedBitSetSize=49 , 
> NormalByteArrayCompressed=76
> ..
> PartitionSize=1001 , BitSetSize=126 , CompressedBitSetSize=137 , 
> NormalByteArrayCompressed=197
> ..
> PartitionSize=2001 , BitSetSize=251 , CompressedBitSetSize=262 , 
> NormalByteArrayCompressed=374
> PartitionSize=4001 , BitSetSize=501 , CompressedBitSetSize=512 , 
> NormalByteArrayCompressed=686
> PartitionSize=8001 , BitSetSize=1001 , CompressedBitSetSize=1012 , 
> NormalByteArrayCompressed=1330
> PartitionSize=16001 , BitSetSize=2001 , CompressedBitSetSize=1979 , 
> NormalByteArrayCompressed=2569
> PartitionSize=32001 , BitSetSize=4001 , CompressedBitSetSize=3885 , 
> NormalByteArrayCompressed=5000
> -This is based on considering random bit positions as empty partitions.
> It is not possible to directly use JDK 1.6's BitSet directly as it does not 
> support valueOf, toByteArray() functions.  Suggestion is to have Tez specific 
> BitSet (until Tez moves to JDK 1.7) and make the compression as a job 
> configuration.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to