[ 
https://issues.apache.org/jira/browse/CRUNCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769895#comment-16769895
 ] 

Josh Wills commented on CRUNCH-677:
-----------------------------------

I think the trick we usually did for this sort of thing would be something akin 
to the optional `conf` settings we allow on (essentially all) Source and Target 
instances; so we could do some signatures like:

 

`Source<T> filesystem(FileSystem fs)`

 

and

 

`Target filesystem(FileSystem fs)`

 

etc. Would that do the trick?

> Support passing FileSystem to File Sources and Targets
> ------------------------------------------------------
>
>                 Key: CRUNCH-677
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-677
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ben Roling
>            Assignee: Josh Wills
>            Priority: Major
>
> We'd like to pass a FileSystem instance to File Sources and Targets to fully 
> qualify the Path.  Without the FileSystem, the Pipeline doesn't necessarily 
> have enough information to understand the Path.  In particular, when the Path 
> is an HA HDFS path like "hdfs://my-cluster/data", the Pipeline might not have 
> the 
> [configuration|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details]
>  to resolve "hdfs://my-cluster".
> While it is in some cases possible to seed the Pipeline configuration with 
> all the HDFS properties necessary to communicate with any HDFS HA cluster the 
> Pipeline might talk to, it can be awkward and/or difficult to do this in all 
> cases.  We have cases where we'd like not to have to know all of the clusters 
> upfront.
> With the proposed change, code like the following is possible, where 
> {{readFileSystem}} and {{writeFileSystem}} are external FileSystems 
> synthesized from Configuration completely separate from that used to 
> construct the Pipeline itself:
> {code}
> Configuration emptyConfiguration = new Configuration(false);
> Pipeline pipeline = new MRPipeline(getClass(), emptyConfiguration);
> FileSystem readFileSystem = ...;
> PCollection<String> data = 
> pipeline.read(From.textFile("hdfs://my-cluster-1/data", readFileSystem));
> FileSystem writeFileSystem = ...;
> pipeline.write(data, To.textFile("hdfs://my-cluster-2/output", 
> writeFileSystem));
> {code}
> Note: the hdfs://my-cluster-1 and hdfs://my-cluster-2 parts of the paths 
> would not strictly need to be included as they would be implied by the 
> FileSystem instances passed in the calls.  As such the paths could simply be 
> passed as "/data" and "/output" with equivalent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to