Thanks Cheng for the insights. 

Regarding the HCatalog, I did some initial investigation too and agree with 
you. As of now, it seems not a good solution. I will try to talk to Hive people 
to see whether there is such guarantee for downward compatibility for thrift 
protocol. By the way, I tried some basic functions using hive-0.13 connect to 
hive-0.14 metastore, and it looks like they are compatible. 

Thanks.

Zhan Zhang


On Nov 22, 2014, at 7:14 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

> Should emphasize that this is still a quick and rough conclusion, will 
> investigate this in more detail after 1.2.0 release. Anyway we really like to 
> provide Hive support in Spark SQL as smooth and clean as possible for both 
> developers and end users.
> 
> On 11/22/14 11:05 PM, Cheng Lian wrote:
>> 
>> Hey Zhan,
>> 
>> This is a great question. We are also seeking for a stable API/protocol that 
>> works with multiple Hive versions (esp. 0.12+). SPARK-4114 
>> <https://issues.apache.org/jira/browse/SPARK-4114> was opened for this. Did 
>> some research into HCatalog recently, but I must confess that I’m not an 
>> expert on HCatalog, actually spent only 1 day on exploring it. So please 
>> don’t hesitate to correct me if I was wrong about the conclusions I made 
>> below.
>> 
>> First, although HCatalog API is more pleasant to work with, it’s 
>> unfortunately feature incomplete. It only provides a subset of most commonly 
>> used operations. For example, |HCatCreateTableDesc| maps only a subset of 
>> |CreateTableDesc|, properties like |storeAsSubDirectories|, |skewedColNames| 
>> and |skewedColValues| are missing. It’s also impossible to alter table 
>> properties via HCatalog API (Spark SQL uses this to implement the |ANALYZE| 
>> command). The |hcat| CLI tool provides all those features missing in 
>> HCatalog API via raw Metastore API, and is structurally similar to the old 
>> Hive CLI.
>> 
>> Second, HCatalog API itself doesn’t ensure compatibility, it’s the Thrift 
>> protocol that matters. HCatalog is directly built upon raw Metastore API, 
>> and talks the same Metastore Thrift protocol. The problem we encountered in 
>> Spark SQL is that, usually we deploy Spark SQL Hive support with embedded 
>> mode (for testing) or local mode Metastore, and this makes us suffer from 
>> things like Metastore database schema changes. If Hive Metastore Thrift 
>> protocol is guaranteed to be downward compatible, then hopefully we can 
>> resort to remote mode Metastore and always depend on most recent Hive APIs. 
>> I had a glance of Thrift protocol version handling code in Hive, it seems 
>> that downward compatibility is not an issue. However I didn’t find any 
>> official documents about Thrift protocol compatibility.
>> 
>> That said, in the future, hopefully we can only depend on most recent Hive 
>> dependencies and remove the Hive shim layer introduced in branch 1.2. For 
>> users who use exactly the same version of Hive as Spark SQL, they can use 
>> either remote or local/embedded Metastore; while for users who want to 
>> interact with existing legacy Hive clusters, they have to setup a remote 
>> Metastore and let the Thrift protocol to handle compatibility.
>> 
>> — Cheng
>> 
>> On 11/22/14 6:51 AM, Zhan Zhang wrote:
>> 
>>> Now Spark and hive integration is a very nice feature. But I am wondering
>>> what the long term roadmap is for spark integration with hive. Both of these
>>> two projects are undergoing fast improvement and changes. Currently, my
>>> understanding is that spark hive sql part relies on hive meta store and
>>> basic parser to operate, and the thrift-server intercept hive query and
>>> replace it with its own engine.
>>> 
>>> With every release of hive, there need a significant effort on spark part to
>>> support it.
>>> 
>>> For the metastore part, we may possibly replace it with hcatalog. But given
>>> the dependency of other parts on hive, e.g., metastore, thriftserver,
>>> hcatlog may not be able to help much.
>>> 
>>> Does anyone have any insight or idea in mind?
>>> 
>>> Thanks.
>>> 
>>> Zhan Zhang
>>> 
>>> 
>>> 
>>> --
>>> View this message in 
>>> context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
>>> Sent from the Apache Spark Developers List mailing list archive at 
>>> Nabble.com.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail:dev-h...@spark.apache.org
>>> 
>>> .
>>> 
>> ​
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to