[ 
https://issues.apache.org/jira/browse/TEZ-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13947208#comment-13947208
 ] 

Gopal V commented on TEZ-945:
-----------------------------

This approach might result in better compression for systems which use proper 
Writable types. This would be possible with a custom MR app and even some parts 
of pig.

For something like hive which uses a SerDe to transform rows into bytes, this 
approach of overloading the data output mechanisms won't work - all rows will 
be bytes of varying sizes, not tuples of ints or floats.

This clearly proves that there is value in splitting up data into multiple 
streams when it comes to ETL efficiency.

But to include hive into this stream splitting, we need a more data-agnostic 
approach & support a SerDe based key/value collections. 

The only assumption we can make is that keys are repeated more often than 
values and that the keys will be sorted.

> ColumnStore-like intermediate file format for shuffle
> -----------------------------------------------------
>
>                 Key: TEZ-945
>                 URL: https://issues.apache.org/jira/browse/TEZ-945
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Tsuyoshi OZAWA
>         Attachments: design.pdf
>
>
> In ETL workload, intermediate data can be large. It is generally known that 
> the shuffle between map phase and reduce phase is the main bottleneck. 
> To improve IFile, a file format used for shuffle in Hadoop MapReduce and Tez, 
> we can improve shuffle performance. One idea is to introduce Column Store 
> idea into IFile. It can improve compression ratio or overhead of IFile. As a 
> result, performance of ETL jobs can get better. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to