[ 
https://issues.apache.org/jira/browse/SPARK-19890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19890:
---------------------------------
    Labels: bulk-closed  (was: )

> Make MetastoreRelation statistics estimation more accurately
> ------------------------------------------------------------
>
>                 Key: SPARK-19890
>                 URL: https://issues.apache.org/jira/browse/SPARK-19890
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Zhan Zhang
>            Priority: Minor
>              Labels: bulk-closed
>
> Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
> and the size is based on the table scope. But for partitioned table, this 
> statistics is not useful as table size may 100x+ larger than the input 
> partition size. As a result, the join optimization techniques is not 
> applicable.
> It would be great if we can postpone the statistics to the optimization phase 
> to get partition information but before physical plan generation phase so 
> that JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
> sortmerjoin).
> Although the metastorerelation does not associated with partitions, but 
> through PhysicalOperation we can get the partition info for the table. 
> Multiple plan can use the same meatstorerelation, but the estimation is still 
> much better than table size. This way, retrieving statistics is 
> straightforward.
> Another possible way is to have a another data structure associating the 
> metastore relation and partitions with the plan to get most accurate 
> estimation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to