[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

steveloughran Thu, 03 Aug 2017 02:32:05 -0700

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18668#discussion_r131094720
  
    --- Diff: docs/configuration.md ---
    @@ -2335,5 +2335,61 @@ The location of these configuration files varies 
across Hadoop versions, but
     a common location is inside of `/etc/hadoop/conf`. Some tools create
     configurations on-the-fly, but offer a mechanisms to download copies of 
them.
     
    -To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
    +To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
     to a location containing the configuration files.
    +
    +# Custom Hadoop/Hive Configuration
    +
    +If your Spark applications interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
    +configuration files in Spark's class path.
    +
    +Multiple running applications might require different Hadoop/Hive client 
side configurations.
    +You can copy and modify `hdfs-site.xml`, `core-site.xml`, `yarn-site.xml`, 
`hive-site.xml` in
    +Spark's class path for each application, but it is not very convenient and 
these
    +files are best to be shared with common properties to avoid hard-coding 
certain configurations.
    +
    +The better choice is to use spark hadoop properties in the form of 
`spark.hadoop.*`. 
    +They can be considered as same as normal spark properties which can be set 
in `$SPARK_HOME/conf/spark-defalut.conf`
    +
    +In some cases, you may want to avoid hard-coding certain configurations in 
a `SparkConf`. For
    +instance, Spark allows you to simply create an empty conf and set 
spark/spark hadoop properties.
    +
    +{% highlight scala %}
    +val conf = new SparkConf().set("spark.hadoop.abc.def","xyz")
    +val sc = new SparkContext(conf)
    +{% endhighlight %}
    +
    +Also, you can modify or add configurations at runtime:
    +{% highlight bash %}
    +./bin/spark-submit \ 
    +  --name "My app" \ 
    +  --master local[4] \  
    +  --conf spark.eventLog.enabled=false \ 
    +  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps" \ 
    +  --conf spark.hadoop.abc.def=xyz \ 
    +  myApp.jar
    +{% endhighlight %}
    +
    +## Typical Hadoop/Hive Configurations
    +
    +<table>
    +<tr>
    +  <td><code>spark.hadoop.<br 
/>mapreduce.fileoutputcommitter.algorithm.version</code></td>
    +  <td>1</td>
    +  <td>
    +    The file output committer algorithm version, valid algorithm version 
number: 1 or 2.
    +    Version 2 may have better performance, but version 1 may handle 
failures better in certain situations,
    +    as per <a 
href="https://issues.apache.org/jira/browse/MAPREDUCE-4815";>MAPREDUCE-4815</a>.
    +  </td>
    +</tr>
    +
    +<tr>
    +  <td><code>spark.hadoop.<br />fs.hdfs.impl.disable.cache</code></td>
    --- End diff --
    
    this is a pretty dangerous one to point people at, especially since it's 
fixed in future Hadoop versions & backported to some distros âand the cost of 
creating a new HDFS client on every worker can get very expensive if you have a 
spark process with many threads, all fielding work from the same user (thread 
pools, IPC connections, ....)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

Reply via email to