[ https://issues.apache.org/jira/browse/TEZ-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054483#comment-14054483 ]
Siddharth Seth commented on TEZ-1228: ------------------------------------- bq. 1) Indexed file format is more work than currently estimated and has very little value unless the index is going to be used eventually. It's definitely a lot more work. Just wondering if, eventually (when it is used by Pig/Hive), how much of a benefit it could provide. I'd guess skipping large chunks would be fairly useful. bq. Fixed key length is a scenario that is less popular than I imagined - NULL is almost always 1 byte value whether it is for Int/Float/Decimal in hive. The sorter itself could use the fixed length nature to be more efficient on memory. In any case, these would be separate jiras - nothing to be done here. > Prototype IFile : Define a memory & merge optimized vertex-intermediate file > format for Tez > ------------------------------------------------------------------------------------------- > > Key: TEZ-1228 > URL: https://issues.apache.org/jira/browse/TEZ-1228 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Labels: perfomance > Attachments: TEZ-1228-IFile.pdf, TEZ-1228.1.patch, TEZ-1228.2.patch, > TEZ-1228.WIP.1.patch, TEZ-1228.WIP.2.patch > > > The current vertex-intermediate format used all across Tez is a flat file of > variable length k,v pairs. For a significant number of use-cases, in > particular the sorted output phase, a large number of consecutive identical > keys are found within the same stream. The IFile format ends up writing each > key out fully into the stream to generate (K,V) pairs instead of ordering it > into a more efficient K, {V1, .. Vn} list. > This duplication of key data needs larger buffers to hold in memory and > requires comparison between keys known to be identical while doing a merge > sort. > This bug tracks the building of a prototype IFile format which is optimized > for lower uncompressed sizes within memory buffers and less compute intensive > to perform merge sorts during the reducer phase. -- This message was sent by Atlassian JIRA (v6.2#6252)