Rajesh Balamohan created TEZ-2198:
-------------------------------------

             Summary: Fix sorter spill counts
                 Key: TEZ-2198
                 URL: https://issues.apache.org/jira/browse/TEZ-2198
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Rajesh Balamohan


Prior to pipelined shuffle, tez merged all spilled data into a single file.  
This ended up creating one index file and one output file. In this context, 
TaskCounter.ADDITIONAL_SPILL_COUNT was referred as the number of additional 
spills and there was no counter needed to track the number of merges.

With pipelined shuffle, there is no final merge and ADDITIONAL_SPILL_COUNT 
would be misleading, as these spills are direct output files which are consumed 
by the consumers.

It would be good to have the following 
- ADDITIONAL_SPILL_COUNT: represents the spills that are needed by the task to 
generate the final merged output
- TOTAL_SPILLS: represents the total number of shuffle directories (index + 
output files) that got created at the end of processing.

For e.g, Assume sorter generated 5 spills in an attempt
Without pipelining:
==============
ADDITIONAL_SPILL_COUNT = 5 <-- Additional spills involved in sorting
TOTAL_SPILLS = 1 <-- Final merged output

With pipelining:
============
ADDITIONAL_SPILL_COUNT = 5 <-- Additional spills involved in sorting
TOTAL_SPILLS = 0 <--- No final output





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to