[jira] [Commented] (CRUNCH-677) Support passing FileSystem to File Sources and Targets

Ben Roling (JIRA) Fri, 15 Feb 2019 07:41:11 -0800


    [ 
https://issues.apache.org/jira/browse/CRUNCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769427#comment-16769427
 ]


Ben Roling commented on CRUNCH-677:
-----------------------------------

I should note the example is a little extreme in seeding the Pipeline with an 
*empty* Configuration.  That's just to emphasize the point that it need not 
have the configuration of all of the FileSystems the Pipeline might need to 
talk to.  In reality the Pipeline would need to start with more configuration 
obviously, such that it can communicate with MapReduce and some default 
FileSystem where it can store intermediate output (i.e. Crunch tmp dir).

> Support passing FileSystem to File Sources and Targets
> ------------------------------------------------------
>
>                 Key: CRUNCH-677
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-677
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ben Roling
>            Assignee: Josh Wills
>            Priority: Major
>
> We'd like to pass a FileSystem instance to File Sources and Targets to fully 
> qualify the Path.  Without the FileSystem, the Pipeline doesn't necessarily 
> have enough information to understand the Path.  In particular, when the Path 
> is an HA HDFS path like "hdfs://my-cluster/data", the Pipeline might not have 
> the 
> [configuration|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details]
>  to resolve "hdfs://my-cluster".
> While it is in some cases possible to seed the Pipeline configuration with 
> all the HDFS properties necessary to communicate with any HDFS HA cluster the 
> Pipeline might talk to, it can be awkward and/or difficult to do this in all 
> cases.  We have cases where we'd like not to have to know all of the clusters 
> upfront.
> With the proposed change, code like the following is possible, where 
> {{readFileSystem}} and {{writeFileSystem}} are external FileSystems 
> synthesized from Configuration completely separate from that used to 
> construct the Pipeline itself:
> {code}
> Configuration emptyConfiguration = new Configuration(false);
> Pipeline pipeline = new MRPipeline(getClass(), emptyConfiguration);
> FileSystem readFileSystem = ...;
> PCollection<String> data = 
> pipeline.read(From.textFile("hdfs://my-cluster-1/data", readFileSystem));
> FileSystem writeFileSystem = ...;
> pipeline.write(data, To.textFile("hdfs://my-cluster-2/output", 
> writeFileSystem));
> {code}
> Note: the hdfs://my-cluster-1 and hdfs://my-cluster-2 parts of the paths 
> would not strictly need to be included as they would be implied by the 
> FileSystem instances passed in the calls.  As such the paths could simply be 
> passed as "/data" and "/output" with equivalent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CRUNCH-677) Support passing FileSystem to File Sources and Targets

Reply via email to