Hey Zhan,
This is a great question. We are also seeking for a stable API/protocol
that works with multiple Hive versions (esp. 0.12+). SPARK-4114
<https://issues.apache.org/jira/browse/SPARK-4114> was opened for this.
Did some research into HCatalog recently, but I must confess that I’m
not an expert on HCatalog, actually spent only 1 day on exploring it. So
please don’t hesitate to correct me if I was wrong about the conclusions
I made below.
First, although HCatalog API is more pleasant to work with, it’s
unfortunately feature incomplete. It only provides a subset of most
commonly used operations. For example, |HCatCreateTableDesc| maps only a
subset of |CreateTableDesc|, properties like |storeAsSubDirectories|,
|skewedColNames| and |skewedColValues| are missing. It’s also impossible
to alter table properties via HCatalog API (Spark SQL uses this to
implement the |ANALYZE| command). The |hcat| CLI tool provides all those
features missing in HCatalog API via raw Metastore API, and is
structurally similar to the old Hive CLI.
Second, HCatalog API itself doesn’t ensure compatibility, it’s the
Thrift protocol that matters. HCatalog is directly built upon raw
Metastore API, and talks the same Metastore Thrift protocol. The problem
we encountered in Spark SQL is that, usually we deploy Spark SQL Hive
support with embedded mode (for testing) or local mode Metastore, and
this makes us suffer from things like Metastore database schema changes.
If Hive Metastore Thrift protocol is guaranteed to be downward
compatible, then hopefully we can resort to remote mode Metastore and
always depend on most recent Hive APIs. I had a glance of Thrift
protocol version handling code in Hive, it seems that downward
compatibility is not an issue. However I didn’t find any official
documents about Thrift protocol compatibility.
That said, in the future, hopefully we can only depend on most recent
Hive dependencies and remove the Hive shim layer introduced in branch
1.2. For users who use exactly the same version of Hive as Spark SQL,
they can use either remote or local/embedded Metastore; while for users
who want to interact with existing legacy Hive clusters, they have to
setup a remote Metastore and let the Thrift protocol to handle
compatibility.
— Cheng
On 11/22/14 6:51 AM, Zhan Zhang wrote:
Now Spark and hive integration is a very nice feature. But I am wondering
what the long term roadmap is for spark integration with hive. Both of these
two projects are undergoing fast improvement and changes. Currently, my
understanding is that spark hive sql part relies on hive meta store and
basic parser to operate, and the thrift-server intercept hive query and
replace it with its own engine.
With every release of hive, there need a significant effort on spark part to
support it.
For the metastore part, we may possibly replace it with hcatalog. But given
the dependency of other parts on hive, e.g., metastore, thriftserver,
hcatlog may not be able to help much.
Does anyone have any insight or idea in mind?
Thanks.
Zhan Zhang
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
.