[jira] [Commented] (TEZ-1228) Prototype IFile : Define a memory & merge optimized vertex-intermediate file format for Tez

Siddharth Seth (JIRA) Mon, 07 Jul 2014 20:43:18 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054483#comment-14054483
 ]


Siddharth Seth commented on TEZ-1228:
-------------------------------------

bq. 1) Indexed file format is more work than currently estimated and has very 
little value unless the index is going to be used eventually.
It's definitely a lot more work. Just wondering if, eventually (when it is used 
by Pig/Hive), how much of a benefit it could provide. I'd guess skipping large 
chunks would be fairly useful.

bq. Fixed key length is a scenario that is less popular than I imagined - NULL 
is almost always 1 byte value whether it is for Int/Float/Decimal in hive.
The sorter itself could use the fixed length nature to be more efficient on 
memory.

In any case, these would be separate jiras - nothing to be done here.

> Prototype IFile : Define a memory & merge optimized vertex-intermediate file 
> format for Tez
> -------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1228
>                 URL: https://issues.apache.org/jira/browse/TEZ-1228
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>              Labels: perfomance
>         Attachments: TEZ-1228-IFile.pdf, TEZ-1228.1.patch, TEZ-1228.2.patch, 
> TEZ-1228.WIP.1.patch, TEZ-1228.WIP.2.patch
>
>
> The current vertex-intermediate format used all across Tez is a flat file of 
> variable length k,v pairs. For a significant number of use-cases, in 
> particular the sorted output phase, a large number of consecutive  identical 
> keys are found within the same stream. The IFile format ends up writing each 
> key out fully into the stream to generate (K,V) pairs instead of ordering it 
> into a more efficient K, {V1, .. Vn} list.
> This duplication of key data needs larger buffers to hold in memory and 
> requires comparison between keys known to be identical while doing a merge 
> sort.
> This bug tracks the building of a prototype IFile format which is optimized 
> for lower uncompressed sizes within memory buffers and less compute intensive 
> to perform merge sorts during the reducer phase.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TEZ-1228) Prototype IFile : Define a memory & merge optimized vertex-intermediate file format for Tez

Reply via email to