[
https://issues.apache.org/jira/browse/HIVE-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858665#comment-15858665
]
Chao Sun commented on HIVE-15489:
---------------------------------
One issue with the current approach is the JOIN operator we are looking at
could be impacted by upstream joins/aggregations
{code}
M1 M2
\ /
(JOIN 1) R1 M3
\ /
\ R2
\ /
R3 (JOIN 2)
{code}
Here there are multiple reduce phases before getting to {{JOIN 2}}, which could
affect the data size a lot.
To minimize this inaccuracy, I propose that *we should only use TS stats if
there is no RS between the JOIN and all roots reachable from it.*
In the above, {{JOIN 1}} satisfies the condition while {{JOIN 2}} does not.
> Alternatively use table scan stats for HoS
> ------------------------------------------
>
> Key: HIVE-15489
> URL: https://issues.apache.org/jira/browse/HIVE-15489
> Project: Hive
> Issue Type: Improvement
> Components: Spark, Statistics
> Affects Versions: 2.2.0
> Reporter: Chao Sun
> Assignee: Chao Sun
> Attachments: HIVE-15489.1.patch, HIVE-15489.2.patch,
> HIVE-15489.wip.patch
>
>
> For MapJoin in HoS, we should provide an option to only use stats in the TS
> rather than the populated stats in each of the join branch. This could be
> pretty conservative but more reliable.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)