[jira] [Comment Edited] (ORC-1553) Reading information from Row group, where there are 0 records of SArg column

Alexander Petrossian (PAF) (Jira) Wed, 20 Dec 2023 23:05:05 -0800


    [ 
https://issues.apache.org/jira/browse/ORC-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799283#comment-17799283
 ]


Alexander Petrossian (PAF) edited comment on ORC-1553 at 12/21/23 7:04 AM:
---------------------------------------------------------------------------

I am :(

{noformat}
entry = {OrcProto$RowIndexEntry@2550} "positions: 0\npositions: 0\npositions: 
228\nstatistics {\n  numberOfValues: 0\n  stringStatistics {\n  }\n  hasNull: 
false\n}\n"
 bitField0_ = 1
 positions_ = {LongArrayList@2580}  size = 3
 positionsMemoizedSerializedSize = -1
 statistics_ = {OrcProto$ColumnStatistics@2581} "numberOfValues: 
0\nstringStatistics {\n}\nhasNull: false\n"
  bitField0_ = 521
  numberOfValues_ = 0
  intStatistics_ = null
  doubleStatistics_ = null
  stringStatistics_ = {OrcProto$StringStatistics@2585} ""
   bitField0_ = 0
   minimum_ = ""
   maximum_ = ""
   sum_ = 0
   lowerBound_ = ""
   upperBound_ = ""
   memoizedIsInitialized = -1
   unknownFields = {UnknownFieldSet@2582} ""
   memoizedSize = -1
   memoizedHashCode = 0
  bucketStatistics_ = null
  decimalStatistics_ = null
  dateStatistics_ = null
  binaryStatistics_ = null
  timestampStatistics_ = null
  hasNull_ = false
  bytesOnDisk_ = 0
  collectionStatistics_ = null
  memoizedIsInitialized = -1
  unknownFields = {UnknownFieldSet@2582} ""
  memoizedSize = -1
  memoizedHashCode = 0
 memoizedIsInitialized = -1
 unknownFields = {UnknownFieldSet@2582} ""
 memoizedSize = -1
 memoizedHashCode = 0
{noformat}

Thanks a lot for looking into this!

I'm open for any experiments.


was (Author: neopaf):
I am :(

 !Снимок экрана 2023-12-21 в 10.00.23.png! 

{noformat}
entry = {OrcProto$RowIndexEntry@2550} "positions: 0\npositions: 0\npositions: 
228\nstatistics {\n  numberOfValues: 0\n  stringStatistics {\n  }\n  hasNull: 
false\n}\n"
 bitField0_ = 1
 positions_ = {LongArrayList@2580}  size = 3
 positionsMemoizedSerializedSize = -1
 statistics_ = {OrcProto$ColumnStatistics@2581} "numberOfValues: 
0\nstringStatistics {\n}\nhasNull: false\n"
  bitField0_ = 521
  numberOfValues_ = 0
  intStatistics_ = null
  doubleStatistics_ = null
  stringStatistics_ = {OrcProto$StringStatistics@2585} ""
   bitField0_ = 0
   minimum_ = ""
   maximum_ = ""
   sum_ = 0
   lowerBound_ = ""
   upperBound_ = ""
   memoizedIsInitialized = -1
   unknownFields = {UnknownFieldSet@2582} ""
   memoizedSize = -1
   memoizedHashCode = 0
  bucketStatistics_ = null
  decimalStatistics_ = null
  dateStatistics_ = null
  binaryStatistics_ = null
  timestampStatistics_ = null
  hasNull_ = false
  bytesOnDisk_ = 0
  collectionStatistics_ = null
  memoizedIsInitialized = -1
  unknownFields = {UnknownFieldSet@2582} ""
  memoizedSize = -1
  memoizedHashCode = 0
 memoizedIsInitialized = -1
 unknownFields = {UnknownFieldSet@2582} ""
 memoizedSize = -1
 memoizedHashCode = 0
{code}


Thanks a lot for looking into this.

I'm open for any experiments.

> Reading information from Row group, where there are 0 records of SArg column
> ----------------------------------------------------------------------------
>
>                 Key: ORC-1553
>                 URL: https://issues.apache.org/jira/browse/ORC-1553
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.9.2
>            Reporter: Alexander Petrossian (PAF)
>            Priority: Major
>         Attachments: Снимок экрана 2023-12-21 в 10.00.23.png
>
>
> We have created .orc file using Apache ORC library, I can not provide a 
> reproducible way to create such a file.
> We have statistics for 100% row groups, checked with orc dump.
> But I see that when we search by that file we get a very strange behavior:
> {code}
> TRACE org.apache.orc.impl.RecordReaderImpl: Stats = numberOfValues: 0
> stringStatistics {
> }
> hasNull: false
> TRACE org.apache.orc.impl.RecordReaderImpl: Setting (EQUALS value 
> 71231231212) to YES_NO_NULL
> DEBUG org.apache.orc.impl.RecordReaderImpl: Row group 340000 to 349999 is 
> included.
> {code}
> If there are 0 values according to existing statistics, so there is obviously 
> no need to read that row group.
> And yet we have YES_NO_NULL decision which forces inclusion of that row group 
> in subsequent operation, which meaningless and bad for performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ORC-1553) Reading information from Row group, where there are 0 records of SArg column

Reply via email to