[
https://issues.apache.org/jira/browse/ORC-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799283#comment-17799283
]
Alexander Petrossian (PAF) edited comment on ORC-1553 at 12/21/23 7:04 AM:
---------------------------------------------------------------------------
I am :(
{noformat}
entry = {OrcProto$RowIndexEntry@2550} "positions: 0\npositions: 0\npositions:
228\nstatistics {\n numberOfValues: 0\n stringStatistics {\n }\n hasNull:
false\n}\n"
bitField0_ = 1
positions_ = {LongArrayList@2580} size = 3
positionsMemoizedSerializedSize = -1
statistics_ = {OrcProto$ColumnStatistics@2581} "numberOfValues:
0\nstringStatistics {\n}\nhasNull: false\n"
bitField0_ = 521
numberOfValues_ = 0
intStatistics_ = null
doubleStatistics_ = null
stringStatistics_ = {OrcProto$StringStatistics@2585} ""
bitField0_ = 0
minimum_ = ""
maximum_ = ""
sum_ = 0
lowerBound_ = ""
upperBound_ = ""
memoizedIsInitialized = -1
unknownFields = {UnknownFieldSet@2582} ""
memoizedSize = -1
memoizedHashCode = 0
bucketStatistics_ = null
decimalStatistics_ = null
dateStatistics_ = null
binaryStatistics_ = null
timestampStatistics_ = null
hasNull_ = false
bytesOnDisk_ = 0
collectionStatistics_ = null
memoizedIsInitialized = -1
unknownFields = {UnknownFieldSet@2582} ""
memoizedSize = -1
memoizedHashCode = 0
memoizedIsInitialized = -1
unknownFields = {UnknownFieldSet@2582} ""
memoizedSize = -1
memoizedHashCode = 0
{noformat}
Thanks a lot for looking into this!
I'm open for any experiments.
was (Author: neopaf):
I am :(
!Снимок экрана 2023-12-21 в 10.00.23.png!
{noformat}
entry = {OrcProto$RowIndexEntry@2550} "positions: 0\npositions: 0\npositions:
228\nstatistics {\n numberOfValues: 0\n stringStatistics {\n }\n hasNull:
false\n}\n"
bitField0_ = 1
positions_ = {LongArrayList@2580} size = 3
positionsMemoizedSerializedSize = -1
statistics_ = {OrcProto$ColumnStatistics@2581} "numberOfValues:
0\nstringStatistics {\n}\nhasNull: false\n"
bitField0_ = 521
numberOfValues_ = 0
intStatistics_ = null
doubleStatistics_ = null
stringStatistics_ = {OrcProto$StringStatistics@2585} ""
bitField0_ = 0
minimum_ = ""
maximum_ = ""
sum_ = 0
lowerBound_ = ""
upperBound_ = ""
memoizedIsInitialized = -1
unknownFields = {UnknownFieldSet@2582} ""
memoizedSize = -1
memoizedHashCode = 0
bucketStatistics_ = null
decimalStatistics_ = null
dateStatistics_ = null
binaryStatistics_ = null
timestampStatistics_ = null
hasNull_ = false
bytesOnDisk_ = 0
collectionStatistics_ = null
memoizedIsInitialized = -1
unknownFields = {UnknownFieldSet@2582} ""
memoizedSize = -1
memoizedHashCode = 0
memoizedIsInitialized = -1
unknownFields = {UnknownFieldSet@2582} ""
memoizedSize = -1
memoizedHashCode = 0
{code}
Thanks a lot for looking into this.
I'm open for any experiments.
> Reading information from Row group, where there are 0 records of SArg column
> ----------------------------------------------------------------------------
>
> Key: ORC-1553
> URL: https://issues.apache.org/jira/browse/ORC-1553
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.9.2
> Reporter: Alexander Petrossian (PAF)
> Priority: Major
> Attachments: Снимок экрана 2023-12-21 в 10.00.23.png
>
>
> We have created .orc file using Apache ORC library, I can not provide a
> reproducible way to create such a file.
> We have statistics for 100% row groups, checked with orc dump.
> But I see that when we search by that file we get a very strange behavior:
> {code}
> TRACE org.apache.orc.impl.RecordReaderImpl: Stats = numberOfValues: 0
> stringStatistics {
> }
> hasNull: false
> TRACE org.apache.orc.impl.RecordReaderImpl: Setting (EQUALS value
> 71231231212) to YES_NO_NULL
> DEBUG org.apache.orc.impl.RecordReaderImpl: Row group 340000 to 349999 is
> included.
> {code}
> If there are 0 values according to existing statistics, so there is obviously
> no need to read that row group.
> And yet we have YES_NO_NULL decision which forces inclusion of that row group
> in subsequent operation, which meaningless and bad for performance.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)