[ 
https://issues.apache.org/jira/browse/TEZ-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283338#comment-14283338
 ] 

Siddharth Seth commented on TEZ-1855:
-------------------------------------

In PipelineSorter
{code}
+      sameVolRename(filename, finalOutputFile);
+      sameVolRename(indexFilename, finalIndexFile);
{code}
finalOutputFile / finalIndexFile would need to be regenerated - otherwise it's 
possible for them to be generated under a different local directory.

{code}
+    Path outputPath = finalOutputFile;
     fileOutputByteCounter.increment(rfs.getFileStatus(outputPath).getLen());
{code}
I'll file a follow up to try and update the counters from previous information 
instead of accessing the disk.


getOutputFile / getOutputIndexFile should also be removed from 
TezTaskOutputFiles - they're used in tests and to figure out empty partition 
information. This will likely have to be exposed by Writers / Inputs (with test 
visibility). Also getInputFile - but that isn't used anywhere.


> Avoid scanning for previously written files within Inputs / Outputs
> -------------------------------------------------------------------
>
>                 Key: TEZ-1855
>                 URL: https://issues.apache.org/jira/browse/TEZ-1855
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-1855.1.patch
>
>
> TezTaskOutput has a bunch of methods - getOutputFile, getOutputIndexFile, 
> getSpillIndexFile - which are used within an  Output to scan for files 
> written earlier by the same Output. This should be avoided in favour of 
> keeping track of previously written files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to