[
https://issues.apache.org/jira/browse/ORC-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798969#comment-17798969
]
Alexander Petrossian (PAF) edited comment on ORC-1553 at 12/20/23 12:27 PM:
----------------------------------------------------------------------------
Quick analysis shows that problem was introduced in ORC-1075, where there was a
situation that row group had *no statistic* information and hence must be
scanned (and that is a good thing).
And yet the check introduced assumes there can be no situation where statistics
was indeed on disk, but had both 0 records AND isNull=false.
Again, I don't know exactly how my colleagues managed to create .orc file with
such statistics. They say it was through Apache ORC library on low level with
Spark on top level.
Can we please treat this issue as a separate issue and don't try to fix the
strange statistics storage but change the ORC-1075 approach to more thorough
check that statistics is indeed MISSING from file. In my case it is PRESENT.
was (Author: neopaf):
Quick analysis shows that problem was introduced in ORC-1075, where there was a
situation that row group and no statistic information and hence must be scanned
(and that is a good thing).
And yet the check introduced assumes there can be no situation where statistics
was indeed on disk, but had both 0 records AND isNull=false.
Again, I don't know exactly how my colleagues managed to create .orc file with
such statistics. They say it was through Apache ORC library on low level with
Spark on top level.
Can we please treat this issue as a separate issue and don't try to fix the
strange statistics storage but change the ORC-1075 approach to more thorough
check that statistics is indeed MISSING from file. In my case it is PRESENT.
> Reading information from Row group, where there are 0 records of SArg column
> ----------------------------------------------------------------------------
>
> Key: ORC-1553
> URL: https://issues.apache.org/jira/browse/ORC-1553
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.9.2
> Reporter: Alexander Petrossian (PAF)
> Priority: Major
>
> We have created .orc file using Apache ORC library, I can not provide a
> reproducible way to create such a file.
> We have statistics for 100% row groups, checked with orc dump.
> But I see that when we search by that file we get a very strange behavior:
> {code}
> TRACE org.apache.orc.impl.RecordReaderImpl: Stats = numberOfValues: 0
> stringStatistics {
> }
> hasNull: false
> TRACE org.apache.orc.impl.RecordReaderImpl: Setting (EQUALS value
> 71231231212) to YES_NO_NULL
> DEBUG org.apache.orc.impl.RecordReaderImpl: Row group 340000 to 349999 is
> included.
> {code}
> If there are 0 values according to existing statistics, so there is obviously
> no need to read that row group.
> And yet we have YES_NO_NULL decision which forces inclusion of that row group
> in subsequent operation, which meaningless and bad for performance.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)