[
https://issues.apache.org/jira/browse/BEAM-9315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037024#comment-17037024
]
Claudio Venturini commented on BEAM-9315:
-----------------------------------------
Hi!
I admit that right before incurring in this issue I also strongly believed that
HADOOP_CONF_DIR should always be a single path. But unfortunately this is what
Cloudera does in spark-env.sh:
{{HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf}}}
{{HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf}}}
{{if [ -d "$HIVE_CONF_DIR" ]; then}}
{{ HADOOP_CONF_DIR="$HADOOP_CONF_DIR:$HIVE_CONF_DIR"}}
{{fi}}
{{export HADOOP_CONF_DIR}}
As you can see, {{HIVE_CONF_DIR}} gets appended in {{HADOOP_CONF_DIR}}.
I know it sounds strange, and in fact I searched the web and in particular in
cloudera forums for somebody talking about this as an issue, but I found
nothing. Thus, I think that Beam should be able to cope with this behaviour.
> HadoopFileSystemOptions unable to interpret HADOOP_CONF_DIR with multiple
> paths
> -------------------------------------------------------------------------------
>
> Key: BEAM-9315
> URL: https://issues.apache.org/jira/browse/BEAM-9315
> Project: Beam
> Issue Type: Bug
> Components: io-java-hadoop-file-system
> Affects Versions: 2.19.0
> Environment: Cloudera CDH 6.3.2 with Spark 2.4.0 (Scala 2.11)
> Reporter: Claudio Venturini
> Assignee: Claudio Venturini
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In certain Hadoop deployments the {{HADOOP_CONF_DIR}} environment variable
> could contain multiple paths. For example, when running {{spark-submit}}
> Cloudera 6.3 sets it as follows:
> {{HADOOP_CONF_DIR=/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/conf/yarn-conf:/etc/hive/conf}}
> Currently the class {{HadoopFileSystemOptions}} reads the content of the
> variable but treats it as a single path. When it contains multiple paths,
> this makes Beam unable to properly configure Hadoop, and so HDFS can't be
> accessed. At the moment, the only work arounds to make it work that I'm aware
> of are:
> - Override the {{HADOOP_CONF_DIR}} set by Cloudera for the Spark service,
> but I think it could cause problems with some other tools (maybe when using
> Hive from Spark, because I think that Spark wouldn't be able to find Hive
> config)
> - Pass HDFS configurations using the {{--hdfsConfigurations}} options, but
> it's inconvenient when there are a lot of config to set, and they would not
> be changed automatically when reconfigured in Cloudera Manager
> In my opinion, to fix this the {{HadoopFileSystemOptions}} class should split
> the content of the {{HADOOP_CONF_DIR}} environment variable by colon (":") to
> detect all paths contained.
> I have already fixed this and all tests on class {{HadoopFileSystemOptions}}
> pass successfully. I'm preparing a pull request.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)