Hyunsik Choi created TAJO-374:
---------------------------------
Summary: Investigate more efficient Intermedaite data handling
Key: TAJO-374
URL: https://issues.apache.org/jira/browse/TAJO-374
Project: Tajo
Issue Type: Improvement
Components: repartitioning
Reporter: Hyunsik Choi
h3. Motivation
Currently, Tajo materializes intermediate data on local disks. Tajo stores one
file for each partition. It becomes inefficient and not scalable as data volume
and increase. In MR, this challenge was resolved by sorting intermediate
key-values, grouping the same key data, and indexing on keys. But, It requires
unnecessary sort and disk I/O. This is not feasible in Tajo.
h3. References
* TAJO-292 is an ad-hoc resolution to reduce the number of intermediate files.
But, it still is not scalable.
* Optimizing MapReduce Job Performance
(http://www.slideshare.net/cloudera/mr-perf)
* Multilevel aggregation for Hadoop/MapReduce
(http://www.slideshare.net/ozax86/prestrata-hadoop-word-meetup)
* SAILFISH: A FRAMEWORK FOR LARGE SCALE DATA PROCESSING
(http://research.yahoo.com/files/yl-2012-002.pdf)
* MAPREDUCE-4502 - Node-level aggregation with combining the result of maps
--
This message was sent by Atlassian JIRA
(v6.1#6144)