[ 
https://issues.apache.org/jira/browse/TEZ-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048089#comment-14048089
 ] 

Gopal V commented on TEZ-1228:
------------------------------

I think dividing the raw/compressed streams has its own advantages.

{code}
+      //TODO: write to raw stream instead of compressed stream.
+      out.write(HEADER);
{code}

That is just 3 bytes, it should be a '\0' at the end of it.

With that header as an option, we'll eventually be able to spill to compressed 
memory instead of spilling to disk to free up space in the sort buffer, using 
the memory buffers as IFiles (similar to how the shuffleToMemory works). 

We already do the equivalent of what MAPREDUCE-5947 is proposing in Tez's 
pipelined sorter, but with a format-defined IFile in-memory we can stretch the 
in-memory capacities by ~2x or more.

> Prototype IFile : Define a memory & merge optimized vertex-intermediate file 
> format for Tez
> -------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1228
>                 URL: https://issues.apache.org/jira/browse/TEZ-1228
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>              Labels: perfomance
>         Attachments: TEZ-1228-IFile.pdf, TEZ-1228.WIP.1.patch
>
>
> The current vertex-intermediate format used all across Tez is a flat file of 
> variable length k,v pairs. For a significant number of use-cases, in 
> particular the sorted output phase, a large number of consecutive  identical 
> keys are found within the same stream. The IFile format ends up writing each 
> key out fully into the stream to generate (K,V) pairs instead of ordering it 
> into a more efficient K, {V1, .. Vn} list.
> This duplication of key data needs larger buffers to hold in memory and 
> requires comparison between keys known to be identical while doing a merge 
> sort.
> This bug tracks the building of a prototype IFile format which is optimized 
> for lower uncompressed sizes within memory buffers and less compute intensive 
> to perform merge sorts during the reducer phase.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to