Claudio Venturini created BEAM-9315:
---------------------------------------
Summary: HadoopFileSystemOptions unable to interpret
HADOOP_CONF_DIR with multiple paths
Key: BEAM-9315
URL: https://issues.apache.org/jira/browse/BEAM-9315
Project: Beam
Issue Type: Bug
Components: io-java-hadoop-file-system
Affects Versions: 2.19.0
Environment: Cloudera CDH 6.3.2 with Spark 2.4.0 (Scala 2.11)
Reporter: Claudio Venturini
In certain Hadoop deployments the {{HADOOP_CONF_DIR}} environment variable
could contain multiple paths. For example, when running {{spark-submit}}
Cloudera 6.3 sets it as follows:
{{HADOOP_CONF_DIR=/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/conf/yarn-conf:/etc/hive/conf}}
Currently the class {{HadoopFileSystemOptions}} reads the content of the
variable but treats it as a single path. When it contains multiple paths, this
makes Beam unable to properly configure Hadoop, and so HDFS can't be accessed.
At the moment, the only work arounds to make it work that I'm aware of are:
- Override the {{HADOOP_CONF_DIR}} set by Cloudera for the Spark service, but
I think it could cause problems with some other tools (maybe when using Hive
from Spark, because I think that Spark wouldn't be able to find Hive config)
- Pass HDFS configurations using the {{--hdfsConfigurations}} options, but
it's inconvenient when there are a lot of config to set, and they would not be
changed automatically when reconfigured in Cloudera Manager
In my opinion, to fix this the {{HadoopFileSystemOptions}} class should split
the content of the {{HADOOP_CONF_DIR}} environment variable by colon (":") to
detect all paths contained.
I have already fixed this and all tests on class {{HadoopFileSystemOptions}}
pass successfully. I'm preparing a pull request.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)