[GitHub] spark issue #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as ...

vanzin Fri, 07 Sep 2018 13:41:39 -0700

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22289
  
    `spark.hadoop.*` is not a good name. That's a special prefix in Spark that 
modifies any Hadoop `Configuration` object that Spark instantiates. That's the 
easy one.
    
    The hard one is that your change doesn't seem to achieve what your PR 
description says. What you're doing is just uploading the contents of 
`spark.hadoop.config.dir` instead of `HADOOP_CONF_DIR` with your YARN app. That 
means a bunch of things:
    
    - the `Client` class is still using whatever Hadoop configuration is in the 
classpath to choose the YARN service that will actually run the app.
    - the uploaded config is actually added at the end of the classpath of the 
AM / executors; the RM places its own configuration before that in the 
classpath, so in the launched processes, you're still *not* going to be using 
the configuration you defined in `spark.hadoop.conf.dir`.
    - the configuration used by the `Client` class that I mention above is 
actually written to a separate file and also sent over to the AM / executors, 
and overlayed on top of the configuration (see 
`SparkHadoopUtil.newConfiguration`).
    
    So to actually achieve what you want to do, you'd have to fix at least two 
things:
    
    - `SparkHadoopUtil.newConfiguration`
    - the way `Client` creates the YARN configuration (which is 
[here](https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L68))
    
    Otherwise, this change isn't actually doing much that I can see.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as ...

Reply via email to