[GitHub] spark pull request: [SPARK-6908] [SQL] Use isolated Hive client

marmbrus Sun, 03 May 2015 17:11:49 -0700

GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/5876


    [SPARK-6908] [SQL] Use isolated Hive client

    This PR switches Spark SQL's Hive support to use the isolated hive client 
interface introduced by #5851, instead of directly interacting with the client. 
 By using this isolated client we can now allow users to dynamically configure 
the version of Hive that they are connecting to by setting 
`spark.sql.hive.version` without the need recompile.  This also greatly reduces 
the surface area for our interaction with the hive libraries, hopefully making 
it easier to support other versions in the future.
    
    Remaining TODO for this PR or a followup PR
     - Currently the classes for a specific version of Hive are retrieved on 
demand from maven.  This is convenient, but won't work in deployments that are 
not connected to the internet.  We should use the Spark jars when connecting to 
the "native" version of Hive and provide an option for users who want to 
manually provide their own version of Hive.
     - The CLI and Thrift server tests aren't passing since we are not passing 
configuration correctly from `-hiveconf` options.
     - Multiple sessions for Hive are broken.
     - Several HiveCompatibility tests are not yet passing. 
      - `nullformatCTAS` - As detailed below, we now are handling CTAS parsing 
ourselves instead of hacking into the Hive semantic analyzer.  However, we 
currently only handle the common cases and not things like CTAS where the null 
format is specified.
      - `combine1` now leaks state about compression somehow, breaking all 
subsequent tests.  As such we currently add it to the blacklist
      - `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not 
work anymore.  We are correctly propagating the information 
      - "load_dyn_part14.*" - These tests pass when run on their own, but fail 
when run with all other tests.  It seems our `RESET` mechanism may not be as 
robust as it used to be?
    
    Other required changes:
     -  `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it 
through the query execution pipeline.  Instead, we parse CTAS during the HiveQL 
conversion and construct a `HiveTable`.  The full parsing here is not yet 
complete as detailed above in the remaining TODOs.  Since the operator is Hive 
specific, it is moved to the hive package.
     - `Command` is simplified to be a trait that simply acts as a marker for a 
LogicalPlan that should be eagerly evaluated.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark useIsolatedClient

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5876.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5876
    
----
commit 8843a255cde337a67e9a3b12440bac511ea69939
Author: Michael Armbrust <[email protected]>
Date:   2015-05-03T20:15:30Z

    [SPARK-6908] [SQL] Use isolated Hive client
    
    Conflicts:
        
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6908] [SQL] Use isolated Hive client

Reply via email to