Rajesh Balamohan created TEZ-2198:
-------------------------------------
Summary: Fix sorter spill counts
Key: TEZ-2198
URL: https://issues.apache.org/jira/browse/TEZ-2198
Project: Apache Tez
Issue Type: Improvement
Reporter: Rajesh Balamohan
Prior to pipelined shuffle, tez merged all spilled data into a single file.
This ended up creating one index file and one output file. In this context,
TaskCounter.ADDITIONAL_SPILL_COUNT was referred as the number of additional
spills and there was no counter needed to track the number of merges.
With pipelined shuffle, there is no final merge and ADDITIONAL_SPILL_COUNT
would be misleading, as these spills are direct output files which are consumed
by the consumers.
It would be good to have the following
- ADDITIONAL_SPILL_COUNT: represents the spills that are needed by the task to
generate the final merged output
- TOTAL_SPILLS: represents the total number of shuffle directories (index +
output files) that got created at the end of processing.
For e.g, Assume sorter generated 5 spills in an attempt
Without pipelining:
==============
ADDITIONAL_SPILL_COUNT = 5 <-- Additional spills involved in sorting
TOTAL_SPILLS = 1 <-- Final merged output
With pipelining:
============
ADDITIONAL_SPILL_COUNT = 5 <-- Additional spills involved in sorting
TOTAL_SPILLS = 0 <--- No final output
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)