[jira] [Updated] (TAJO-374) Investigate more efficient intermediate shuffle methods

Hyunsik Choi (JIRA) Thu, 31 Jul 2014 02:54:32 -0700

     [ 
https://issues.apache.org/jira/browse/TAJO-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyunsik Choi updated TAJO-374:
------------------------------

    Description: 
h3. Motivation

Currently, Tajo materializes intermediate data on local disks. Tajo stores one 
file for each partition. It becomes inefficient and not scalable as data volume 
and increase. In MR, this challenge was resolved by sorting intermediate 
key-values, grouping the same key data, and indexing on keys. But, It requires 
unnecessary sort and disk I/O. This is not feasible in Tajo.

h3. References
 * TAJO-292 is an ad-hoc resolution to reduce the number of intermediate files. 
But, it still is not scalable.
 * Optimizing MapReduce Job Performance 
(http://www.slideshare.net/cloudera/mr-perf)
 * Multilevel aggregation for Hadoop/MapReduce 
(http://www.slideshare.net/ozax86/prestrata-hadoop-word-meetup)
 * SAILFISH: A FRAMEWORK FOR LARGE SCALE DATA PROCESSING 
(http://research.yahoo.com/files/yl-2012-002.pdf)
 * MAPREDUCE-4502 - Node-level aggregation with combining the result of maps
 * MAPREDUCE-2841 - Task level native optimization

  was:
h3. Motivation

Currently, Tajo materializes intermediate data on local disks. Tajo stores one 
file for each partition. It becomes inefficient and not scalable as data volume 
and increase. In MR, this challenge was resolved by sorting intermediate 
key-values, grouping the same key data, and indexing on keys. But, It requires 
unnecessary sort and disk I/O. This is not feasible in Tajo.

h3. References
 * TAJO-292 is an ad-hoc resolution to reduce the number of intermediate files. 
But, it still is not scalable.
 * Optimizing MapReduce Job Performance 
(http://www.slideshare.net/cloudera/mr-perf)
 * Multilevel aggregation for Hadoop/MapReduce 
(http://www.slideshare.net/ozax86/prestrata-hadoop-word-meetup)
 * SAILFISH: A FRAMEWORK FOR LARGE SCALE DATA PROCESSING 
(http://research.yahoo.com/files/yl-2012-002.pdf)
 * MAPREDUCE-4502 - Node-level aggregation with combining the result of maps


> Investigate more efficient intermediate shuffle methods
> -------------------------------------------------------
>
>                 Key: TAJO-374
>                 URL: https://issues.apache.org/jira/browse/TAJO-374
>             Project: Tajo
>          Issue Type: Improvement
>          Components: data shuffle
>            Reporter: Hyunsik Choi
>
> h3. Motivation
> Currently, Tajo materializes intermediate data on local disks. Tajo stores 
> one file for each partition. It becomes inefficient and not scalable as data 
> volume and increase. In MR, this challenge was resolved by sorting 
> intermediate key-values, grouping the same key data, and indexing on keys. 
> But, It requires unnecessary sort and disk I/O. This is not feasible in Tajo.
> h3. References
>  * TAJO-292 is an ad-hoc resolution to reduce the number of intermediate 
> files. But, it still is not scalable.
>  * Optimizing MapReduce Job Performance 
> (http://www.slideshare.net/cloudera/mr-perf)
>  * Multilevel aggregation for Hadoop/MapReduce 
> (http://www.slideshare.net/ozax86/prestrata-hadoop-word-meetup)
>  * SAILFISH: A FRAMEWORK FOR LARGE SCALE DATA PROCESSING 
> (http://research.yahoo.com/files/yl-2012-002.pdf)
>  * MAPREDUCE-4502 - Node-level aggregation with combining the result of maps
>  * MAPREDUCE-2841 - Task level native optimization



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TAJO-374) Investigate more efficient intermediate shuffle methods

Reply via email to