Hi Sparkers,

hoping for insight here:
running a simple describe mytable here where mytable is a partitioned Hive
table.

Spark produces the following times:

Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02,
SQL query: 72.831, Reading results: 0.189

​

Whereas Hive over the same metastore shows:

Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL
query: 0.204, Reading results: 0.236

​

I am looking at the metastore as Thriftserver couldn't start up at all
until I increased

hive.metastore.client.socket.timeout to 600


Why would metastore access from Spark's Thriftserver be so much worse than
from Hive?


The issue is pretty urgent for me as I ran into this problem during a push
to a production cluster (QA metastore table is smaller and it's a different
cluster that didn't show this).


Is there a known issue with metastore access -- I only see
https://issues.apache.org/jira/browse/SPARK-5923 but I'm using Postgres. We
are upgrading from Shark and both Hive and Shark process this a lot faster.


Describe table in itself is not a critical query for me but I am
experiencing performance hit in other queries and I'm suspecting the
metastore interaction (e.g.
https://www.mail-archive.com/user@spark.apache.org/msg26242.html)

Reply via email to