Jianshi Huang created SPARK-4760:
------------------------------------

             Summary: "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed 
estimating table size for tables created from Parquet files
                 Key: SPARK-4760
                 URL: https://issues.apache.org/jira/browse/SPARK-4760
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.2.0, 1.3.0
            Reporter: Jianshi Huang


In a older Spark version built around Oct. 12, I was able to use 

  ANALYZE TABLE table COMPUTE STATISTICS noscan

to get estimated table size, which is important for optimizing joins. (I'm 
joining 15 small dimension tables, and this is crucial to me).

In the more recent Spark builds, it fails to estimate the table size unless I 
remove "noscan".

Here's the statistics I got using DESC EXTENDED:

old:
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}

new:
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, 
COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}

And I've tried turning off spark.sql.hive.convertMetastoreParquet in my 
spark-defaults.conf and the result is unaffected (in both versions).

Looks like the Parquet support in new Hive (0.13.1) is broken?


Jianshi




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to