[ https://issues.apache.org/jira/browse/TEZ-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071658#comment-14071658 ]
Rajesh Balamohan commented on TEZ-1288: --------------------------------------- Ran a benchmark (hive query at 200 GB scale) Without patch: Total KV pairs written to IFile in the job: 910,798,869 IFile Raw Len in bytes: 28,642,808,103 Job runtime: 417.3 seconds With patch: Total KV pairs written to IFile in the job: 910,798,869 IFile Raw Len in bytes: 21,356,783,865 Job runtime: 398 seconds - With patch, there is 25% improvement in raw length. This would directly attribute to better usage of sort buffer as well. - With patch, there is 5% improvement in job runtime. (If defaultCodec is used, improvement is around 8%). > Create FastTezSerialization as an optional feature > -------------------------------------------------- > > Key: TEZ-1288 > URL: https://issues.apache.org/jira/browse/TEZ-1288 > Project: Apache Tez > Issue Type: Improvement > Affects Versions: 0.5.0 > Reporter: Gopal V > Assignee: Rajesh Balamohan > Attachments: TEZ-1288.1.patch, TEZ-1288.2.patch > > > Tez inherits the writable framework from map-reduce. > This is very flexible, but not particularly memory efficient for the small > data types. > When deserializing, each value and key has to be allocated afresh for each > small chunk of data (new IntWritable instead of .set()). > The bytes writable serialization operation always has to write a 4 byte > prefix for all values and keys, because of requirements around streamed > .readFields() instead of a customer setter/getter impl. > Implement a faster serialization mechanism for the inner loop of sort, spill, > merge, which doesn't trigger the GC and avoids adding simplistic overheads to > the IFile format. -- This message was sent by Atlassian JIRA (v6.2#6252)