[ 
https://issues.apache.org/jira/browse/HIVE-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860918#comment-15860918
 ] 

Xuefu Zhang commented on HIVE-15489:
------------------------------------

Thanks for working on this, [~csun]! I had one pass over this patch and have 
the following thoughts:
1. The new configuration might have a better name. "hive.spark.use.ts.stats" 
seems a little too general. Please consider a more specific name, something 
like "hive_on_spark.use.file.size.for.mapjoin". Very minor though.
2. For new property, we probably want to default it to the old behavior when 
checking in. Maybe we can have some test cases run with this new configuration 
on.
3. If join op isn't coming directly from table scan, I saw we are still using 
operator stats to decide mapjoin. This can still cause the issue of inaccurate 
estimation, right? Should we just don't convert it to map join in such a case?
4. There seems to be some test failures in the above run. Are they related?

> Alternatively use table scan stats for HoS
> ------------------------------------------
>
>                 Key: HIVE-15489
>                 URL: https://issues.apache.org/jira/browse/HIVE-15489
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark, Statistics
>    Affects Versions: 2.2.0
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>         Attachments: HIVE-15489.1.patch, HIVE-15489.2.patch, 
> HIVE-15489.3.patch, HIVE-15489.4.patch, HIVE-15489.wip.patch
>
>
> For MapJoin in HoS, we should provide an option to only use stats in the TS 
> rather than the populated stats in each of the join branch. This could be 
> pretty conservative but more reliable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to