[ https://issues.apache.org/jira/browse/CRUNCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769895#comment-16769895 ]
Josh Wills commented on CRUNCH-677: ----------------------------------- I think the trick we usually did for this sort of thing would be something akin to the optional `conf` settings we allow on (essentially all) Source and Target instances; so we could do some signatures like: `Source<T> filesystem(FileSystem fs)` and `Target filesystem(FileSystem fs)` etc. Would that do the trick? > Support passing FileSystem to File Sources and Targets > ------------------------------------------------------ > > Key: CRUNCH-677 > URL: https://issues.apache.org/jira/browse/CRUNCH-677 > Project: Crunch > Issue Type: Improvement > Components: Core > Reporter: Ben Roling > Assignee: Josh Wills > Priority: Major > > We'd like to pass a FileSystem instance to File Sources and Targets to fully > qualify the Path. Without the FileSystem, the Pipeline doesn't necessarily > have enough information to understand the Path. In particular, when the Path > is an HA HDFS path like "hdfs://my-cluster/data", the Pipeline might not have > the > [configuration|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details] > to resolve "hdfs://my-cluster". > While it is in some cases possible to seed the Pipeline configuration with > all the HDFS properties necessary to communicate with any HDFS HA cluster the > Pipeline might talk to, it can be awkward and/or difficult to do this in all > cases. We have cases where we'd like not to have to know all of the clusters > upfront. > With the proposed change, code like the following is possible, where > {{readFileSystem}} and {{writeFileSystem}} are external FileSystems > synthesized from Configuration completely separate from that used to > construct the Pipeline itself: > {code} > Configuration emptyConfiguration = new Configuration(false); > Pipeline pipeline = new MRPipeline(getClass(), emptyConfiguration); > FileSystem readFileSystem = ...; > PCollection<String> data = > pipeline.read(From.textFile("hdfs://my-cluster-1/data", readFileSystem)); > FileSystem writeFileSystem = ...; > pipeline.write(data, To.textFile("hdfs://my-cluster-2/output", > writeFileSystem)); > {code} > Note: the hdfs://my-cluster-1 and hdfs://my-cluster-2 parts of the paths > would not strictly need to be included as they would be implied by the > FileSystem instances passed in the calls. As such the paths could simply be > passed as "/data" and "/output" with equivalent behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)