Ben Roling created CRUNCH-677:
---------------------------------

             Summary: Support passing FileSystem to File Sources and Targets
                 Key: CRUNCH-677
                 URL: https://issues.apache.org/jira/browse/CRUNCH-677
             Project: Crunch
          Issue Type: Improvement
          Components: Core
            Reporter: Ben Roling
            Assignee: Josh Wills


We'd like to pass a FileSystem instance to File Sources and Targets to fully 
qualify the Path.  Without the FileSystem, the Pipeline doesn't necessarily 
have enough information to understand the Path.  In particular, when the Path 
is an HA HDFS path like "hdfs://my-cluster/data", the Pipeline itself may not 
have the necessary 
[configuration|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details]
 to resolve "hdfs://my-cluster".

While it is in some cases possible to seed the Pipeline configuration with all 
the HDFS properties necessary to communicate with any HDFS HA cluster the 
Pipeline might talk to, it can be awkward and/or difficult to do this in all 
cases.  We have cases where we'd like not to have to know all of the clusters 
upfront.

With the proposed change, code like the following is possible, where 
{{readFileSystem}} and {{writeFileSystem}} are external FileSystems synthesized 
from Configuration completely separate from that used to construct the Pipeline 
itself:

{code}
Configuration emptyConfiguration = new Configuration(false);
Pipeline pipeline = new MRPipeline(getClass(), emptyConfiguration);

FileSystem readFileSystem = ...;
PCollection<String> data = 
pipeline.read(From.textFile("hdfs://my-cluster-1/data", readFileSystem));

FileSystem writeFileSystem = ...;
pipeline.write(data, To.textFile("hdfs://my-cluster-2/output", 
writeFileSystem));
{code}

Note: the hdfs://my-cluster-1 and hdfs://my-cluster-2 parts of the paths would 
not strictly need to be included as they would be implied by the FileSystem 
instances passed in the calls.  As such the paths could simply be passed as 
"/data" and "/output" with equivalent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to