Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/22289
`spark.hadoop.*` is not a good name. That's a special prefix in Spark that
modifies any Hadoop `Configuration` object that Spark instantiates. That's the
easy one.
The hard one is that your change doesn't seem to achieve what your PR
description says. What you're doing is just uploading the contents of
`spark.hadoop.config.dir` instead of `HADOOP_CONF_DIR` with your YARN app. That
means a bunch of things:
- the `Client` class is still using whatever Hadoop configuration is in the
classpath to choose the YARN service that will actually run the app.
- the uploaded config is actually added at the end of the classpath of the
AM / executors; the RM places its own configuration before that in the
classpath, so in the launched processes, you're still *not* going to be using
the configuration you defined in `spark.hadoop.conf.dir`.
- the configuration used by the `Client` class that I mention above is
actually written to a separate file and also sent over to the AM / executors,
and overlayed on top of the configuration (see
`SparkHadoopUtil.newConfiguration`).
So to actually achieve what you want to do, you'd have to fix at least two
things:
- `SparkHadoopUtil.newConfiguration`
- the way `Client` creates the YARN configuration (which is
[here](https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L68))
Otherwise, this change isn't actually doing much that I can see.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]