GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/15667
[SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql
than it does in hive-client
## What changes were proposed in this pull request?
As reported on the jira, insert overwrite statement runs much slower in
Spark, compared with hive-client.
It seems there is a patch
[HIVE-11940](https://github.com/apache/hive/commit/ba21806b77287e237e1aa68fa169d2a81e07346d)
which largely improves insert overwrite performance on Hive. HIVE-11940 is
patched after Hive 2.0.0.
Because Spark SQL uses older Hive library, we can not benefit from such
improvement.
The reporter verified that there is also a big performance gap between Hive
1.2.1 and Hive 2.0.1 on insert overwrite execution.
Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial
task, this patch provides an approach to delete the partition before asking
Hive to load data files into the partition.
Note: since `Hive.loadTable` also uses the function to replace files, it
should has the same issue. We can take the same approach to delete the table
first. I will upgrade this to include this.
## How was this patch tested?
Jenkins tests.
There are existing tests using insert overwrite statement. Those tests
should be passed. I added a new test to specially test insert overwrite into
partition.
For performance issue, as I don't have Hive 2.0 environment, this needs the
reporter to verify this patch. Please refer to the jira.
Please review
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before
opening a pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 improve-hive-insertoverwrite
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15667.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15667
----
commit 81dbeb19e61a67a287a5762e391517eb55a20721
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-10-27T09:29:16Z
Drop partition before insert overwrite to Hive table.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]