[ https://issues.apache.org/jira/browse/CRUNCH-658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772200#comment-16772200 ]
Andrew Olson commented on CRUNCH-658: ------------------------------------- Looks like skipping the getLastModifiedAt for sources & targets is another potential optimization? Especially since it's only actually used (as far as I can tell) if the WriteMode is CHECKPOINT. The implementations of getPathSize and getLastModifiedAt in SourceTargetHelper are similar so I would expect them to have the same performance issue with object stores. > Add a way to skip the getSize checks for Sources from object stores > ------------------------------------------------------------------- > > Key: CRUNCH-658 > URL: https://issues.apache.org/jira/browse/CRUNCH-658 > Project: Crunch > Issue Type: Bug > Components: Core > Affects Versions: 0.14.0 > Reporter: Josh Wills > Assignee: Josh Wills > Priority: Major > > Ran into a problem when using Crunch to process a _lot_ of data from S3: the > getSize checks can be very slow to run and don't materially add much to the > overall processing of a pipeline when things like reducer counts are manually > specified. I'd like to add a way to disable the file size checks, either > globally or for specific input sources. -- This message was sent by Atlassian JIRA (v7.6.3#76005)