GitHub user marmbrus opened a pull request:
https://github.com/apache/spark/pull/5876
[SPARK-6908] [SQL] Use isolated Hive client
This PR switches Spark SQL's Hive support to use the isolated hive client
interface introduced by #5851, instead of directly interacting with the client.
By using this isolated client we can now allow users to dynamically configure
the version of Hive that they are connecting to by setting
`spark.sql.hive.version` without the need recompile. This also greatly reduces
the surface area for our interaction with the hive libraries, hopefully making
it easier to support other versions in the future.
Remaining TODO for this PR or a followup PR
- Currently the classes for a specific version of Hive are retrieved on
demand from maven. This is convenient, but won't work in deployments that are
not connected to the internet. We should use the Spark jars when connecting to
the "native" version of Hive and provide an option for users who want to
manually provide their own version of Hive.
- The CLI and Thrift server tests aren't passing since we are not passing
configuration correctly from `-hiveconf` options.
- Multiple sessions for Hive are broken.
- Several HiveCompatibility tests are not yet passing.
- `nullformatCTAS` - As detailed below, we now are handling CTAS parsing
ourselves instead of hacking into the Hive semantic analyzer. However, we
currently only handle the common cases and not things like CTAS where the null
format is specified.
- `combine1` now leaks state about compression somehow, breaking all
subsequent tests. As such we currently add it to the blacklist
- `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not
work anymore. We are correctly propagating the information
- "load_dyn_part14.*" - These tests pass when run on their own, but fail
when run with all other tests. It seems our `RESET` mechanism may not be as
robust as it used to be?
Other required changes:
- `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it
through the query execution pipeline. Instead, we parse CTAS during the HiveQL
conversion and construct a `HiveTable`. The full parsing here is not yet
complete as detailed above in the remaining TODOs. Since the operator is Hive
specific, it is moved to the hive package.
- `Command` is simplified to be a trait that simply acts as a marker for a
LogicalPlan that should be eagerly evaluated.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/marmbrus/spark useIsolatedClient
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5876.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5876
----
commit 8843a255cde337a67e9a3b12440bac511ea69939
Author: Michael Armbrust <[email protected]>
Date: 2015-05-03T20:15:30Z
[SPARK-6908] [SQL] Use isolated Hive client
Conflicts:
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]