[ 
https://issues.apache.org/jira/browse/CRUNCH-677?focusedWorklogId=202056&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-202056
 ]

ASF GitHub Bot logged work on CRUNCH-677:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 21/Feb/19 16:36
            Start Date: 21/Feb/19 16:36
    Worklog Time Spent: 10m 
      Work Description: ben-roling commented on pull request #19: CRUNCH-677 
Source and Target accept FileSystem
URL: https://github.com/apache/crunch/pull/19#discussion_r259011894
 
 

 ##########
 File path: 
crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
 ##########
 @@ -178,7 +206,7 @@ public void handleOutputs(Configuration conf, Path 
workingPath, int index) throw
       if (useDistributedCopy) {
         LOG.info("Source and destination are in different file systems, 
performing distributed copy from {} to {}", srcPattern,
             path);
-        handeOutputsDistributedCopy(conf, srcPattern, srcFs, dstFs, 
maxDistributedCopyTasks);
+        handleOutputsDistributedCopy(dstFsConf, srcPattern, srcFs, dstFs, 
maxDistributedCopyTasks);
 
 Review comment:
   This is a cherry-pick merge mistake causing the build to fail.  I'll fix in 
a second and make sure the build and all tests are passing.   This was 
originally developed on an internal fork and reviewed with my colleagues, 
@noslowerdna and @mkwhitacre.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 202056)
    Time Spent: 20m  (was: 10m)

> Support passing FileSystem to File Sources and Targets
> ------------------------------------------------------
>
>                 Key: CRUNCH-677
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-677
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ben Roling
>            Assignee: Josh Wills
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We'd like to pass a FileSystem instance to File Sources and Targets to fully 
> qualify the Path.  Without the FileSystem, the Pipeline doesn't necessarily 
> have enough information to understand the Path.  In particular, when the Path 
> is an HA HDFS path like "hdfs://my-cluster/data", the Pipeline might not have 
> the 
> [configuration|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details]
>  to resolve "hdfs://my-cluster".
> While it is in some cases possible to seed the Pipeline configuration with 
> all the HDFS properties necessary to communicate with any HDFS HA cluster the 
> Pipeline might talk to, it can be awkward and/or difficult to do this in all 
> cases.  We have cases where we'd like not to have to know all of the clusters 
> upfront.
> With the proposed change, code like the following is possible, where 
> {{readFileSystem}} and {{writeFileSystem}} are external FileSystems 
> synthesized from Configuration completely separate from that used to 
> construct the Pipeline itself:
> {code}
> Configuration emptyConfiguration = new Configuration(false);
> Pipeline pipeline = new MRPipeline(getClass(), emptyConfiguration);
> FileSystem readFileSystem = ...;
> PCollection<String> data = 
> pipeline.read(From.textFile("hdfs://my-cluster-1/data", readFileSystem));
> FileSystem writeFileSystem = ...;
> pipeline.write(data, To.textFile("hdfs://my-cluster-2/output", 
> writeFileSystem));
> {code}
> Note: the hdfs://my-cluster-1 and hdfs://my-cluster-2 parts of the paths 
> would not strictly need to be included as they would be implied by the 
> FileSystem instances passed in the calls.  As such the paths could simply be 
> passed as "/data" and "/output" with equivalent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to