[ 
https://issues.apache.org/jira/browse/HIVE-15477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765601#comment-15765601
 ] 

Chao Sun commented on HIVE-15477:
---------------------------------

Thanks for the quick reply [~prasanth_j]! Yes I'm fine if the existing configs 
can work.

However, I think the issue is that *the compound effect of chained AND 
predicates could be really dramatic*, and it could be pretty common since often 
one has multiple filter conditions in complex data analysis.

For instance, imaging you have a big fact table *A* with many filters and a 
small dimension table *B*. When stats are correct, one should put *B* in the 
build side. However, due to the incorrect stats on *A*'s branch, the # of rows 
in the "final" stats could be really small, and thus Hive will put *A* on the 
build side and *B* on the probe side, which may then fail because the eventual 
# of output rows from *A*'s branch could be much bigger than the stats 
indicated.

I'm not sure if any of the current config can help with the above case (besides 
mapjoin hint, which is outdated). Also I think the issue is not limited to the 
mapjoin case, but any query with filters. The 
{{hive.stats.filter.predicate.factor}} provides user with an option to control 
the degree of optimism on the filtering.

Any further thought?


> Provide options to adjust filter stats when column stats are not available
> --------------------------------------------------------------------------
>
>                 Key: HIVE-15477
>                 URL: https://issues.apache.org/jira/browse/HIVE-15477
>             Project: Hive
>          Issue Type: Bug
>          Components: Statistics
>    Affects Versions: 2.2.0
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>         Attachments: HIVE-15477.1.patch
>
>
> Currently when column stats are not available, Hive will assume the "worst" 
> case by setting the # of output rows to be 1/2 of the # of input rows, for 
> each predicate expression. This could be inaccurate, especially in the 
> presence of multiple predicates chained by AND. We have found in some cases 
> this could cause map join to have wrong ordering and thus fail with memory 
> issue.
> One suggestion is to provide a config (such as {{hive.stats.filter.factor}}) 
> that can be used to control the percentage of rows emitted by a predicate 
> expression. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to