[
https://issues.apache.org/jira/browse/SPARK-19890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-19890:
---------------------------------
Labels: bulk-closed (was: )
> Make MetastoreRelation statistics estimation more accurately
> ------------------------------------------------------------
>
> Key: SPARK-19890
> URL: https://issues.apache.org/jira/browse/SPARK-19890
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Zhan Zhang
> Priority: Minor
> Labels: bulk-closed
>
> Currently the MetastoreRelation statistics is retrieved on the analyze phase,
> and the size is based on the table scope. But for partitioned table, this
> statistics is not useful as table size may 100x+ larger than the input
> partition size. As a result, the join optimization techniques is not
> applicable.
> It would be great if we can postpone the statistics to the optimization phase
> to get partition information but before physical plan generation phase so
> that JoinSelection can choose better join methd (broadcast, shuffledjoin, or
> sortmerjoin).
> Although the metastorerelation does not associated with partitions, but
> through PhysicalOperation we can get the partition info for the table.
> Multiple plan can use the same meatstorerelation, but the estimation is still
> much better than table size. This way, retrieving statistics is
> straightforward.
> Another possible way is to have a another data structure associating the
> metastore relation and partitions with the plan to get most accurate
> estimation.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]