Rajesh Balamohan created TEZ-1228:
-------------------------------------

             Summary: Prototype IFile : Define a memory & merge optimized 
vertex-intermediate file format for Tez
                 Key: TEZ-1228
                 URL: https://issues.apache.org/jira/browse/TEZ-1228
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Rajesh Balamohan


The current vertex-intermediate format used all across Tez is a flat file of 
variable length k,v pairs. For a significant number of use-cases, in particular 
the sorted output phase, a large number of consecutive  identical keys are 
found within the same stream. The IFile format ends up writing each key out 
fully into the stream to generate (K,V) pairs instead of ordering it into a 
more efficient K, {V1, .. Vn} list.

This duplication of key data needs larger buffers to hold in memory and 
requires comparison between keys known to be identical while doing a merge sort.

This bug tracks the building of a prototype IFile format which is optimized for 
lower uncompressed sizes within memory buffers and less compute intensive to 
perform merge sorts during the reducer phase.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to