[ 
https://issues.apache.org/jira/browse/HIVE-15477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765724#comment-15765724
 ] 

Prasanth Jayachandran commented on HIVE-15477:
----------------------------------------------

Agreed that the compound effect of chained AND can be bad when there is 
absolutely no stats. But the primary issue is when we don't have stats we 
operate on the file size and estimate the number of rows (which can be way off 
because of compression). On top of that we introduce "join_key_column IS NOT 
NULL" predicates. This can dramatically mis-estimate the data size as we don't 
have knowledge about the compression or any statistics. That's the primary 
reason for adding the configs initially. 

There are 2 ends of spectrum for the cardinality estimation, assuming the 
column has NDV of 1 then all rows are selected and if NDV == numRows then 
numRows/2 will be estimated to be selected which is not bad (when column stats 
is present this is going to estimate numRows/NDV which is also not correct when 
there is skew). In case of chained AND predicates we are going to filter more 
rows by definition which will be worse without stats.
So my point is even with configs we are going to make wrong estimations :) 

I would rather ignore the "join_key_column IS NOT NULL" from affecting the 
stats in the worst case. Preferably, we want the optimizer to take care of such 
scenarios than users. IMHO we should also consider ignoring the 
"join_key_column IS NOT NULL" that is introduced by hive. Not sure which 
optimizer introduces this predicate. May be [~jcamachorodriguez] has idea?

I am not against adding another config but it will be better if optimizer can 
be make better decision for worst case scenarios with or without the config.



> Provide options to adjust filter stats when column stats are not available
> --------------------------------------------------------------------------
>
>                 Key: HIVE-15477
>                 URL: https://issues.apache.org/jira/browse/HIVE-15477
>             Project: Hive
>          Issue Type: Bug
>          Components: Statistics
>    Affects Versions: 2.2.0
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>         Attachments: HIVE-15477.1.patch
>
>
> Currently when column stats are not available, Hive will assume the "worst" 
> case by setting the # of output rows to be 1/2 of the # of input rows, for 
> each predicate expression. This could be inaccurate, especially in the 
> presence of multiple predicates chained by AND. We have found in some cases 
> this could cause map join to have wrong ordering and thus fail with memory 
> issue.
> One suggestion is to provide a config (such as {{hive.stats.filter.factor}}) 
> that can be used to control the percentage of rows emitted by a predicate 
> expression. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to