Kent Yao created SPARK-34346:
--------------------------------

             Summary: io.file.buffer.size set by spark.buffer.size will 
override by hive-site.xml may cause perf regression
                 Key: SPARK-34346
                 URL: https://issues.apache.org/jira/browse/SPARK-34346
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 3.1.1
            Reporter: Kent Yao


In many real-world cases, when interacting with hive catalog through Spark SQL, 
users may just share the `hive-site.xml` for their hive jobs and make a copy to 
`SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop 
configurations, we will use `spark.buffer.size(65536)` to reset  
`io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore 
this behavior and reset `io.file.buffer.size` again according to 
`hive-site.xml`.

1. The configuration priority for setting Hadoop and Hive config here is not 
right, while literally, the order should be `spark > spark.hive > spark.hadoop 
> hive > hadoop`

2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO 
performance w/ HDFS if there is an existing `io.file.buffer.size` in 
hive-site.xml 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to