[ 
https://issues.apache.org/jira/browse/IMPALA-8205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772225#comment-16772225
 ] 

Tim Armstrong commented on IMPALA-8205:
---------------------------------------

I don't think there's a reason (aside from the API deficiencies in the HMS) for 
us to write -1s here. According to the robustness principle, we should try to 
be more conservative about what we write. Also according to the robustness 
principle,  systems should be liberal about what they accept - I'd suggest just 
ignoring any stats < 0. Impala has been doing this for a long time so the 
metadata is out there. We've also seen various cases of other systems 
generating bogus stats. E.g. setting numRows = 0 for non-empty partitions.

It seems like we should calculate numTrues and numFalse, no specific reason not 
to, just someone has to put the time into it. IMO it's a bit too cute - why 
implement histograms as a special-case for this one data type? Seems like 
queries that would actually benefit (as opposed to guessing that 50% of values 
are true/false) are pretty rare. Anyway, doesn't really matter since that ship 
has sailed.

I don't particularly like [~wuchang1989]'s solution because we should be able 
to write out the valid numNulls stats. I don't think we should have to add a 
hack like that to work around Hive's API issues and other systems' robustness 
issues with stats.



> Illegal statistics for numFalse and numTrue
> -------------------------------------------
>
>                 Key: IMPALA-8205
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8205
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: wuchang
>            Priority: Major
>              Labels: impala, numFalse, numTrue, statistics
>
> When impala compute statistics, it set *numFalse = -1* and *numTrue = 1* when 
> the statistic is missing;
> *-1* for *numFalse* will corrupt some query engine like Presto and there 
> already exists some PR report and hotfix it : 
> [presto-11859|https://github.com/prestodb/presto/pull/11859]
> *1* for *numTrue* is also unreasonable because we are not sure whether it 
> indicates the real numTrue statistics or a missing statistics;
> Also, previously , the *nullCount* also use -1 to indicate its absence which 
> also caused problem for Presto. Presto has to add a hotfix for 
> it([presto-11549|https://github.com/prestodb/presto/pull/11549]) . But it is 
> a fortunate that impala has fixed this bug;
> It is necessary to set to null when these statistics are absent instead of -1 
> and 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to