Re: How spark and hive integrate in long term?

Cheng Lian Sat, 22 Nov 2014 07:17:15 -0800

Should emphasize that this is still a quick and rough conclusion, willinvestigate this in more detail after 1.2.0 release. Anyway we reallylike to provide Hive support in Spark SQL as smooth and clean aspossible for both developers and end users.


On 11/22/14 11:05 PM, Cheng Lian wrote:

Hey Zhan,
This is a great question. We are also seeking for a stableAPI/protocol that works with multiple Hive versions (esp. 0.12+).SPARK-4114 <https://issues.apache.org/jira/browse/SPARK-4114> wasopened for this. Did some research into HCatalog recently, but I mustconfess that I’m not an expert on HCatalog, actually spent only 1 dayon exploring it. So please don’t hesitate to correct me if I was wrongabout the conclusions I made below.
First, although HCatalog API is more pleasant to work with, it’sunfortunately feature incomplete. It only provides a subset of mostcommonly used operations. For example, |HCatCreateTableDesc| maps onlya subset of |CreateTableDesc|, properties like|storeAsSubDirectories|, |skewedColNames| and |skewedColValues| aremissing. It’s also impossible to alter table properties via HCatalogAPI (Spark SQL uses this to implement the |ANALYZE| command). The|hcat| CLI tool provides all those features missing in HCatalog APIvia raw Metastore API, and is structurally similar to the old Hive CLI.
Second, HCatalog API itself doesn’t ensure compatibility, it’s theThrift protocol that matters. HCatalog is directly built upon rawMetastore API, and talks the same Metastore Thrift protocol. Theproblem we encountered in Spark SQL is that, usually we deploy SparkSQL Hive support with embedded mode (for testing) or local modeMetastore, and this makes us suffer from things like Metastoredatabase schema changes. If Hive Metastore Thrift protocol isguaranteed to be downward compatible, then hopefully we can resort toremote mode Metastore and always depend on most recent Hive APIs. Ihad a glance of Thrift protocol version handling code in Hive, itseems that downward compatibility is not an issue. However I didn’tfind any official documents about Thrift protocol compatibility.
That said, in the future, hopefully we can only depend on most recentHive dependencies and remove the Hive shim layer introduced in branch1.2. For users who use exactly the same version of Hive as Spark SQL,they can use either remote or local/embedded Metastore; while forusers who want to interact with existing legacy Hive clusters, theyhave to setup a remote Metastore and let the Thrift protocol to handlecompatibility.
— Cheng

On 11/22/14 6:51 AM, Zhan Zhang wrote:
Now Spark and hive integration is a very nice feature. But I am wondering
what the long term roadmap is for spark integration with hive. Both of these
two projects are undergoing fast improvement and changes. Currently, my
understanding is that spark hive sql part relies on hive meta store and
basic parser to operate, and the thrift-server intercept hive query and
replace it with its own engine.

With every release of hive, there need a significant effort on spark part to
support it.

For the metastore part, we may possibly replace it with hcatalog. But given
the dependency of other parts on hive, e.g., metastore, thriftserver,
hcatlog may not be able to help much.

Does anyone have any insight or idea in mind?

Thanks.

Zhan Zhang



--
View this message in 
context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail:[email protected]
For additional commands, e-mail:[email protected]

.

Re: How spark and hive integrate in long term?

Reply via email to