[
https://issues.apache.org/jira/browse/CRUNCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769935#comment-16769935
]
Josh Wills commented on CRUNCH-677:
-----------------------------------
That would be great, thank you!
> Support passing FileSystem to File Sources and Targets
> ------------------------------------------------------
>
> Key: CRUNCH-677
> URL: https://issues.apache.org/jira/browse/CRUNCH-677
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Ben Roling
> Assignee: Josh Wills
> Priority: Major
>
> We'd like to pass a FileSystem instance to File Sources and Targets to fully
> qualify the Path. Without the FileSystem, the Pipeline doesn't necessarily
> have enough information to understand the Path. In particular, when the Path
> is an HA HDFS path like "hdfs://my-cluster/data", the Pipeline might not have
> the
> [configuration|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details]
> to resolve "hdfs://my-cluster".
> While it is in some cases possible to seed the Pipeline configuration with
> all the HDFS properties necessary to communicate with any HDFS HA cluster the
> Pipeline might talk to, it can be awkward and/or difficult to do this in all
> cases. We have cases where we'd like not to have to know all of the clusters
> upfront.
> With the proposed change, code like the following is possible, where
> {{readFileSystem}} and {{writeFileSystem}} are external FileSystems
> synthesized from Configuration completely separate from that used to
> construct the Pipeline itself:
> {code}
> Configuration emptyConfiguration = new Configuration(false);
> Pipeline pipeline = new MRPipeline(getClass(), emptyConfiguration);
> FileSystem readFileSystem = ...;
> PCollection<String> data =
> pipeline.read(From.textFile("hdfs://my-cluster-1/data", readFileSystem));
> FileSystem writeFileSystem = ...;
> pipeline.write(data, To.textFile("hdfs://my-cluster-2/output",
> writeFileSystem));
> {code}
> Note: the hdfs://my-cluster-1 and hdfs://my-cluster-2 parts of the paths
> would not strictly need to be included as they would be implied by the
> FileSystem instances passed in the calls. As such the paths could simply be
> passed as "/data" and "/output" with equivalent behavior.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)