[ https://issues.apache.org/jira/browse/TEZ-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324679#comment-16324679 ]
Jonathan Eagles commented on TEZ-3877: -------------------------------------- [~jlowe], still looking at this in detail, but have a couple of investigations left. Generally, I think that this work for all the cases where spillInfoList is populated. However, there are some cases where spill files are not placed into that list. Try to verify that this handles all the cases. Specifically, only files that don't have an index file associated with them are placed in the spillInfoList as shown below. {code} private void handleSpillIndex(SpillPathDetails spillPathDetails, TezSpillRecord spillRecord) throws IOException { if (spillPathDetails.indexFilePath != null) { //write the index record spillRecord.writeToFile(spillPathDetails.indexFilePath, conf); } else { //add to cache SpillInfo spillInfo = new SpillInfo(spillRecord, spillPathDetails.outputFilePath); spillInfoList.add(spillInfo); numAdditionalSpillsCounter.increment(1); } } {code} I think this makes me wonder what cases have index files. I wonder if pipeline spill have index files as show below. On the other hand it looks like write large records are correct at first glance. {code} private SpillPathDetails getSpillPathDetails(boolean isFinalSpill, long expectedSpillSize, int spillNumber) throws IOException { long spillSize = (expectedSpillSize < 0) ? (currentBuffer.nextPosition + numPartitions * APPROX_HEADER_LENGTH) : expectedSpillSize; Path outputFilePath = null; Path indexFilePath = null; if (!pipelinedShuffle && isFinalMergeEnabled) { if (isFinalSpill) { outputFilePath = outputFileHandler.getOutputFileForWrite(spillSize); indexFilePath = outputFileHandler.getOutputIndexFileForWrite(indexFileSizeEstimate); //Setting this for tests finalOutPath = outputFilePath; finalIndexPath = indexFilePath; } else { outputFilePath = outputFileHandler.getSpillFileForWrite(spillNumber, spillSize); } } else { outputFilePath = outputFileHandler.getSpillFileForWrite(spillNumber, spillSize); indexFilePath = outputFileHandler.getSpillIndexFileForWrite(spillNumber, indexFileSizeEstimate); } return new SpillPathDetails(outputFilePath, indexFilePath, spillNumber); } {code} > Delete unordered spill files once merge is done > ----------------------------------------------- > > Key: TEZ-3877 > URL: https://issues.apache.org/jira/browse/TEZ-3877 > Project: Apache Tez > Issue Type: Bug > Reporter: Rohini Palaniswamy > Assignee: Jason Lowe > Attachments: TEZ-3877.001.patch > > > I see that spill files are not deleted right after merge completes. We > should do that as it takes up a lot of space and we can't afford that wastage > when Tez takes up a lot of shuffle space with complex DAGs. [~jlowe] told me > they are only cleaned up after application completes as they are written in > app directory and not container directory. That also has to be done so that > they are cleaned up by node manager during task failures or container crashes. -- This message was sent by Atlassian JIRA (v6.4.14#64029)