Spark 3.x compatibility with Hive 3.x

James Hsieh Sun, 17 Mar 2024 19:40:59 -0700

Hello,

(I just asked this same question on stackoverflow. Not sure if I can get
any answer here or there faster. If I do I will update the other side).


I am practicing hdfs, hive and spark. I installed Hadoop 3.3.6, Hive 3.1.3
and Spark 3.4.2, but am unable to perform any SQL in pyspark shell. The
error I am getting is:

org.apache.thrift.TApplicationException: Invalid method name: 'get_database'

>From Spark 3.4.2 documentation, it by default uses Hive 2.3.9 for metastore
version, but it can be configured to 3.x.x. So I specified "--conf
'spark.sql.hive.metastore.version=3.1.3' --conf
'spark.sql.hive.metastore.jars=maven'" to my pyspark shell starting script,
but it just gives the get_database problem.

If I don't specify the metastore configuration parameters (i.e. just run
with "pyspark --conf 'spark.sql.catalogImplementation=hive' --conf
'hive.metastore.uris=thrift://master1:10000'"), Spark creates a metastore
in my current local directory (it doesn't even try to connect to my Hive
server). I am wonder why. I do also have hive-site.xml placed in my
$SPARK_HOME/conf directory, and the following property in it:

<property>
    <name>hive.metastore.uris</name>
    <value>thrift://master1:10000</value>
</property>

By github source code search, it seems to me get_database isn't there in
Hive 3.x. I also downloaded and tried Hive 4.0 Beta version, which seems to
have get_database, but the problem persists. I would not want to downgrade
my Hive version to 2.x to cause compatibility issue with my Hadoop 3.x.

Thanks, James Hsieh

Spark 3.x compatibility with Hive 3.x

Reply via email to