[
https://issues.apache.org/jira/browse/TEZ-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283338#comment-14283338
]
Siddharth Seth commented on TEZ-1855:
-------------------------------------
In PipelineSorter
{code}
+ sameVolRename(filename, finalOutputFile);
+ sameVolRename(indexFilename, finalIndexFile);
{code}
finalOutputFile / finalIndexFile would need to be regenerated - otherwise it's
possible for them to be generated under a different local directory.
{code}
+ Path outputPath = finalOutputFile;
fileOutputByteCounter.increment(rfs.getFileStatus(outputPath).getLen());
{code}
I'll file a follow up to try and update the counters from previous information
instead of accessing the disk.
getOutputFile / getOutputIndexFile should also be removed from
TezTaskOutputFiles - they're used in tests and to figure out empty partition
information. This will likely have to be exposed by Writers / Inputs (with test
visibility). Also getInputFile - but that isn't used anywhere.
> Avoid scanning for previously written files within Inputs / Outputs
> -------------------------------------------------------------------
>
> Key: TEZ-1855
> URL: https://issues.apache.org/jira/browse/TEZ-1855
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Assignee: Rajesh Balamohan
> Attachments: TEZ-1855.1.patch
>
>
> TezTaskOutput has a bunch of methods - getOutputFile, getOutputIndexFile,
> getSpillIndexFile - which are used within an Output to scan for files
> written earlier by the same Output. This should be avoided in favour of
> keeping track of previously written files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)