[ 
https://issues.apache.org/jira/browse/TEZ-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071658#comment-14071658
 ] 

Rajesh Balamohan commented on TEZ-1288:
---------------------------------------

Ran a benchmark (hive query at 200 GB scale)

Without patch:
Total KV pairs written to IFile in the job: 910,798,869
IFile Raw Len in bytes: 28,642,808,103
Job runtime: 417.3 seconds

With patch:
Total KV pairs written to IFile in the job: 910,798,869
IFile Raw Len in bytes: 21,356,783,865
Job runtime: 398 seconds

- With patch, there is 25% improvement in raw length.  This would directly 
attribute to better usage of sort buffer as well.
- With patch, there is 5% improvement in job runtime. (If defaultCodec is used, 
improvement is around 8%).




> Create FastTezSerialization as an optional feature
> --------------------------------------------------
>
>                 Key: TEZ-1288
>                 URL: https://issues.apache.org/jira/browse/TEZ-1288
>             Project: Apache Tez
>          Issue Type: Improvement
>    Affects Versions: 0.5.0
>            Reporter: Gopal V
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-1288.1.patch, TEZ-1288.2.patch
>
>
> Tez inherits the writable framework from map-reduce. 
> This is very flexible, but not particularly memory efficient for the small 
> data types.
> When deserializing, each value and key has to be allocated afresh for each 
> small chunk of data (new IntWritable instead of .set()).
> The bytes writable serialization operation always has to write a 4 byte 
> prefix for  all values and keys, because of requirements around streamed 
> .readFields() instead of a customer setter/getter impl.
> Implement a faster serialization mechanism for the inner loop of sort, spill, 
> merge, which doesn't trigger the GC and avoids adding simplistic overheads to 
> the IFile format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to