Claudio Venturini created BEAM-9315:
---------------------------------------

             Summary: HadoopFileSystemOptions unable to interpret 
HADOOP_CONF_DIR with multiple paths
                 Key: BEAM-9315
                 URL: https://issues.apache.org/jira/browse/BEAM-9315
             Project: Beam
          Issue Type: Bug
          Components: io-java-hadoop-file-system
    Affects Versions: 2.19.0
         Environment: Cloudera CDH 6.3.2 with Spark 2.4.0 (Scala 2.11)
            Reporter: Claudio Venturini


In certain Hadoop deployments the {{HADOOP_CONF_DIR}} environment variable 
could contain multiple paths. For example, when running {{spark-submit}} 
Cloudera 6.3 sets it as follows:

{{HADOOP_CONF_DIR=/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/conf/yarn-conf:/etc/hive/conf}}

Currently the class {{HadoopFileSystemOptions}} reads the content of the 
variable but treats it as a single path. When it contains multiple paths, this 
makes Beam unable to properly configure Hadoop, and so HDFS can't be accessed. 
At the moment, the only work arounds to make it work that I'm aware of are:
 - Override the {{HADOOP_CONF_DIR}} set by Cloudera for the Spark service, but 
I think it could cause problems with some other tools (maybe when using Hive 
from Spark, because I think that Spark wouldn't be able to find Hive config)
 - Pass HDFS configurations using the {{--hdfsConfigurations}} options, but 
it's inconvenient when there are a lot of config to set, and they would not be 
changed automatically when reconfigured in Cloudera Manager

In my opinion, to fix this the {{HadoopFileSystemOptions}} class should split 
the content of the {{HADOOP_CONF_DIR}} environment variable by colon (":") to 
detect all paths contained.

I have already fixed this and all tests on class {{HadoopFileSystemOptions}} 
pass successfully. I'm preparing a pull request.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to