[
https://issues.apache.org/jira/browse/ORC-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799079#comment-17799079
]
Yiqun Zhang commented on ORC-1553:
----------------------------------
[~neopaf] Are you sure there's a 100% statistic?
{code:java}
TRACE org.apache.orc.impl.RecordReaderImpl: Stats = numberOfValues: 0
stringStatistics \{
}
hasNull: false
{code}
This looks like the default statistics that are filled in. I made some
attempts. I don't think it's possible to use the exposed api to construct a
column statistic as numberOfValues = 0 hasNull = false.
[https://github.com/apache/orc/blob/ede42277e10486e4885ce8f99facd7d194a79498/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L83-L87]
[https://github.com/apache/orc/blob/ede42277e10486e4885ce8f99facd7d194a79498/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L1184-L1187]
> Reading information from Row group, where there are 0 records of SArg column
> ----------------------------------------------------------------------------
>
> Key: ORC-1553
> URL: https://issues.apache.org/jira/browse/ORC-1553
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.9.2
> Reporter: Alexander Petrossian (PAF)
> Priority: Major
>
> We have created .orc file using Apache ORC library, I can not provide a
> reproducible way to create such a file.
> We have statistics for 100% row groups, checked with orc dump.
> But I see that when we search by that file we get a very strange behavior:
> {code}
> TRACE org.apache.orc.impl.RecordReaderImpl: Stats = numberOfValues: 0
> stringStatistics {
> }
> hasNull: false
> TRACE org.apache.orc.impl.RecordReaderImpl: Setting (EQUALS value
> 71231231212) to YES_NO_NULL
> DEBUG org.apache.orc.impl.RecordReaderImpl: Row group 340000 to 349999 is
> included.
> {code}
> If there are 0 values according to existing statistics, so there is obviously
> no need to read that row group.
> And yet we have YES_NO_NULL decision which forces inclusion of that row group
> in subsequent operation, which meaningless and bad for performance.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)