[ 
https://issues.apache.org/jira/browse/CRUNCH-660?focusedWorklogId=200836&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-200836
 ]

ASF GitHub Bot logged work on CRUNCH-660:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Feb/19 19:42
            Start Date: 19/Feb/19 19:42
    Worklog Time Spent: 10m 
      Work Description: noslowerdna commented on issue #17: CRUNCH-660, 
CRUNCH-675: Use DistCp instead of FileUtils.copy when sou…
URL: https://github.com/apache/crunch/pull/17#issuecomment-465280884
 
 
   This was committed here: 
https://github.com/apache/crunch/commit/07458f78282e1b55aee90960818f5fcb35dae5c0
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 200836)
            Time Spent: 10m
    Remaining Estimate: 0h

> FileTargetImpl uses Distcp vs FileUtils.copy
> --------------------------------------------
>
>                 Key: CRUNCH-660
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-660
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Micah Whitacre
>            Assignee: Josh Wills
>            Priority: Major
>             Fix For: 1.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> So for handling multiple runtimes I'm not sure there is a way to solve this 
> but documenting as a JIRA regardless.
> If you are running in a multi-cluster environment where you might want to 
> read data from one cluster and then write the output on another cluster (e.g. 
> generating HFiles to be loaded into a separate HBase cluster), the 
> performance of moving files is noticeable.  Specifically due to the fact that 
> the moving of the files happens in the launcher/driver process versus as part 
> of the node execution it seems.[1]
> An efficient option would be to kick off a DistCp instead but that would tie 
> the target directly to a runtime which is not a great approach.  
> [1] - 
> https://github.com/apache/crunch/blob/5609b014378d3460a55ce25522f0c00659872807/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java#L157



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to