[ 
https://issues.apache.org/jira/browse/BEAM-7613?focusedWorklogId=264742&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-264742
 ]

ASF GitHub Bot logged work on BEAM-7613:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 21/Jun/19 16:21
            Start Date: 21/Jun/19 16:21
    Worklog Time Spent: 10m 
      Work Description: dmvk commented on pull request #8923: [BEAM-7613] 
HadoopFileSystem can work with more than one cluster.
URL: https://github.com/apache/beam/pull/8923#discussion_r296304905
 
 

 ##########
 File path: 
sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java
 ##########
 @@ -313,7 +328,7 @@ protected HadoopResourceId matchNewResource(String 
singleResourceSpec, boolean i
 
   @Override
   protected String getScheme() {
-    return fileSystem.getScheme();
+    return "hdfs";
 
 Review comment:
   Got the idea. Anyway don't think, that `hdfs\d+` is a valid scheme, that 
hadoop understands.
   
   If we want to support multiple clusters, using the native protocol, what 
matters is an authority section of the URL. I think we can introduce some kind 
of `AuthorityAwareFileSystem`, that would be just a container for the 
HadoopFileSystems with the same scheme and decide based on the authority part, 
which filesystem should be used.
   
   There are some things I'm not sure about.
   1) What should we do in case there are multiple configs for same scheme + 
authority? I think failing fast should be the best option here.
   2) We are duplicating functionality already provided by Hadoop. This is an 
equivalent of specifying multiple `nameservice` entries in the same config. Do 
we want to do this, because it is kind of an unexpected API for an user coming 
from the Hadoop world?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 264742)
    Time Spent: 40m  (was: 0.5h)

> HadoopFileSystem can be only used with fs.defaultFS
> ---------------------------------------------------
>
>                 Key: BEAM-7613
>                 URL: https://issues.apache.org/jira/browse/BEAM-7613
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-hadoop-file-system
>    Affects Versions: 2.13.0
>            Reporter: David Moravek
>            Assignee: David Moravek
>            Priority: Major
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> _HadoopFileSystem_ creates underlying _FileSystem_ (one from 
> org.apache.hadoop) instance during it's construction. Single _FileSystem_ 
> instance is tied to a particular cluster (scheme + authority pair). In case 
> we want to talk to another cluster, this fail due to _FileSystem#checkPath_.
>  
> This can be fixed by using _FileSystem#get(java.net.URI, 
> org.apache.hadoop.conf.Configuration)_ instead of 
> _FileSystem#newInstance(org.apache.hadoop.conf.Configuration)_{{}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to