Besides I have tried ANALYZE statement. It has no use cause I need the single
partition but get the total table size by hive parameter 'totalSize' or
'rawSize' and so on
Hi, guys:
I'm using Spark1.6.2.
There are two tables and the small one is a partitioned parquet table;
The total size of the small table is 1000M but each partition only 1M;
When I set spark.sql.autoBroadcastJoinThreshold to 50m and join the two
tables with single partition, I get the SortMergeJoin physical plan.
I have made some try and it has something to do with the partition pruning:
1. check the physical plan, and all of the partitions of the small table
are added in.
It seems like https://issues.apache.org/jira/browse/SPARK-16980
2. set spark.sql.hive.convertMetastoreParquet=false
The pruning is success, but still get SortMergeJoin because the code
HiveMetastoreCatalog.scala
@transient override lazy val statistics: Statistics = Statistics(
sizeInBytes = {
val totalSize = hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE)
val rawDataSize =
hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE) The total size
of the table not the single partition.
How can I fix this without patches? Or Is there a patch for SPARK1.6 about
SPARK-16980.
best regards!
Jerry