[ https://issues.apache.org/jira/browse/TEZ-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360173#comment-14360173 ]
Gopal V commented on TEZ-2198: ------------------------------ That exact update means that a clear recommendation can be made on whether to use this optimization or not by simply checking the ADDITIONAL_SPILL_COUNT & once it is active ADDITIONAL_SPILL_COUNT will always be zero. That makes it easy to check whether pipelined-shuffle is active & to predict whether it adds any benefit for a given case. > Fix sorter spill counts > ----------------------- > > Key: TEZ-2198 > URL: https://issues.apache.org/jira/browse/TEZ-2198 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > > Prior to pipelined shuffle, tez merged all spilled data into a single file. > This ended up creating one index file and one output file. In this context, > TaskCounter.ADDITIONAL_SPILL_COUNT was referred as the number of additional > spills and there was no counter needed to track the number of merges. > With pipelined shuffle, there is no final merge and ADDITIONAL_SPILL_COUNT > would be misleading, as these spills are direct output files which are > consumed by the consumers. > It would be good to have the following > - ADDITIONAL_SPILL_COUNT: represents the spills that are needed by the task > to generate the final merged output > - TOTAL_SPILLS: represents the total number of shuffle directories (index + > output files) that got created at the end of processing. > For e.g, Assume sorter generated 5 spills in an attempt > Without pipelining: > ============== > ADDITIONAL_SPILL_COUNT = 5 <-- Additional spills involved in sorting > TOTAL_SPILLS = 1 <-- Final merged output > With pipelining: > ============ > ADDITIONAL_SPILL_COUNT = 5 <-- Additional spills involved in sorting > TOTAL_SPILLS = 0 <--- No final output -- This message was sent by Atlassian JIRA (v6.3.4#6332)