[jira] [Commented] (CRUNCH-256) SequentialFileNamingScheme should cache the # of files in the target directory after the first read

Gabriel Reid (JIRA) Thu, 22 Aug 2013 23:42:28 -0700

    [ 
https://issues.apache.org/jira/browse/CRUNCH-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748343#comment-13748343
 ]


Gabriel Reid commented on CRUNCH-256:
-------------------------------------

The one potential issue that I can see with this is some kind of situation 
where you have two targets that point to the same place, you're using 
WriteMode.APPEND, and you call run() multiple times on the pipeline.

For example:

    targetA = To.textFile("output");
    targetB = To.textFile("output");

    pipeline.write(collectionA, targetA, WriteMode.APPEND);
    pipeline.write(collectionB, targetB, WriteMode.APPEND);
    pipeline.run();
    pipeline.write(collectionC, targetA, WriteMode.APPEND);

I think in that case the contents of collectionB would probably be overwritten 
with this change.

I think that's a pretty big stretch of a use case though, and I'm wondering if 
other things would blow up with that use case before the FileNamingScheme even 
became an issue.

It looks like we could have some kind of cache invalidation hook on a 
FileNamingScheme that is called in FileTargetImpl#handleOutputs, although that 
feels a bit wrong to me.
                
> SequentialFileNamingScheme should cache the # of files in the target 
> directory after the first read
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-256
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-256
>             Project: Crunch
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>             Fix For: 0.8.0
>
>         Attachments: CRUNCH-256.patch
>
>
> After a job finishes running, the post-job hooks rename the files from a temp 
> output directory to the target output directory. When we have lots of files, 
> this move can take a long time, and I traced the performance issue to the 
> fact that SequentialFileNamingScheme does a listStatus() on the output 
> directory for every file that gets moved. If SequentialFileNamingScheme just 
> does this check once and then increments an internal counter, we can 
> significantly decrease the performance overhead involved with the move.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-256) SequentialFileNamingScheme should cache the # of files in the target directory after the first read

Reply via email to