[ 
https://issues.apache.org/jira/browse/TEZ-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324679#comment-16324679
 ] 

Jonathan Eagles commented on TEZ-3877:
--------------------------------------

[~jlowe], still looking at this in detail, but have a couple of investigations 
left. Generally, I think that this work for all the cases where spillInfoList 
is populated. However, there are some cases where spill files are not placed 
into that list. Try to verify that this handles all the cases. Specifically, 
only files that don't have an index file associated with them are placed in the 
spillInfoList as shown below.

{code}
  private void handleSpillIndex(SpillPathDetails spillPathDetails, 
TezSpillRecord spillRecord)
      throws IOException {
    if (spillPathDetails.indexFilePath != null) {
      //write the index record
      spillRecord.writeToFile(spillPathDetails.indexFilePath, conf);
    } else {
      //add to cache
      SpillInfo spillInfo = new SpillInfo(spillRecord, 
spillPathDetails.outputFilePath);
      spillInfoList.add(spillInfo);
      numAdditionalSpillsCounter.increment(1);
    }
  }
{code}

I think this makes me wonder what cases have index files. I wonder if pipeline 
spill have index files as show below. On the other hand it looks like write 
large records are correct at first glance.

{code}
  private SpillPathDetails getSpillPathDetails(boolean isFinalSpill, long 
expectedSpillSize,
      int spillNumber) throws IOException {
    long spillSize = (expectedSpillSize < 0) ?
        (currentBuffer.nextPosition + numPartitions * APPROX_HEADER_LENGTH) : 
expectedSpillSize;

    Path outputFilePath = null;
    Path indexFilePath = null;

    if (!pipelinedShuffle && isFinalMergeEnabled) {
      if (isFinalSpill) {
        outputFilePath = outputFileHandler.getOutputFileForWrite(spillSize);
        indexFilePath = 
outputFileHandler.getOutputIndexFileForWrite(indexFileSizeEstimate);

        //Setting this for tests
        finalOutPath = outputFilePath;
        finalIndexPath = indexFilePath;
      } else {
        outputFilePath = outputFileHandler.getSpillFileForWrite(spillNumber, 
spillSize);
      }
    } else {
      outputFilePath = outputFileHandler.getSpillFileForWrite(spillNumber, 
spillSize);
      indexFilePath  = outputFileHandler.getSpillIndexFileForWrite(spillNumber, 
indexFileSizeEstimate);
    }

    return new SpillPathDetails(outputFilePath, indexFilePath, spillNumber);
  }
{code}



> Delete unordered spill files once merge is done
> -----------------------------------------------
>
>                 Key: TEZ-3877
>                 URL: https://issues.apache.org/jira/browse/TEZ-3877
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Jason Lowe
>         Attachments: TEZ-3877.001.patch
>
>
>   I see that spill files are not deleted right after merge completes. We 
> should do that as it takes up a lot of space and we can't afford that wastage 
> when Tez takes up a lot of shuffle space with complex DAGs. [~jlowe] told me 
> they are only cleaned up after application completes as they are written in 
> app directory and not container directory. That also has to be done so that 
> they are cleaned up by node manager during task failures or container crashes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to