Jianshi Huang created SPARK-4760:
------------------------------------
Summary: "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed
estimating table size for tables created from Parquet files
Key: SPARK-4760
URL: https://issues.apache.org/jira/browse/SPARK-4760
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Jianshi Huang
In a older Spark version built around Oct. 12, I was able to use
ANALYZE TABLE table COMPUTE STATISTICS noscan
to get estimated table size, which is important for optimizing joins. (I'm
joining 15 small dimension tables, and this is crucial to me).
In the more recent Spark builds, it fails to estimate the table size unless I
remove "noscan".
Here's the statistics I got using DESC EXTENDED:
old:
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}
new:
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
And I've tried turning off spark.sql.hive.convertMetastoreParquet in my
spark-defaults.conf and the result is unaffected (in both versions).
Looks like the Parquet support in new Hive (0.13.1) is broken?
Jianshi
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]