[jira] [Commented] (ORC-1075) Failed to read rows from the ORC file without statistics in RowIndex when filter is pushed down for 1.6.11

Bobby Wang (Jira) Wed, 05 Jan 2022 16:36:42 -0800


    [ 
https://issues.apache.org/jira/browse/ORC-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469604#comment-17469604
 ]


Bobby Wang commented on ORC-1075:
---------------------------------

Even `index.getNumberOfValues()` can return the actual number, but because 
there is no ColumnStatistic in RowIndex, so 

the index will not match any Integer/Double ... ColumnStatistics, finally it 
goes into 
{code:java}
new ValueRange(predicate, null, null, true);{code}
and get the same result.

 
{code:java}
// 1.8.0-SNAPSHOT
static ValueRange getValueRange(ColumnStatistics index,
                                PredicateLeaf predicate,
                                boolean useUTCTimestamp) {
  if (index.getNumberOfValues() == 0) {
    return new ValueRange<>(predicate, index.hasNull());
  } else if (index instanceof IntegerColumnStatistics) {
    IntegerColumnStatistics stats = (IntegerColumnStatistics) index;
    Long min = stats.getMinimum();
    Long max = stats.getMaximum();
    return new ValueRange<>(predicate, min, max, stats.hasNull());
  } else if (index instanceof CollectionColumnStatistics) {
    CollectionColumnStatistics stats = (CollectionColumnStatistics) index;
    Long min = stats.getMinimumChildren();
    Long max = stats.getMaximumChildren();
    return new ValueRange<>(predicate, min, max, stats.hasNull());
  }else if (index instanceof DoubleColumnStatistics) {
    DoubleColumnStatistics stats = (DoubleColumnStatistics) index;
    Double min = stats.getMinimum();
    Double max = stats.getMaximum();
    return new ValueRange<>(predicate, min, max, stats.hasNull());
  } else if (index instanceof StringColumnStatistics) {
    StringColumnStatistics stats = (StringColumnStatistics) index;
    return new ValueRange<>(predicate, stats.getLowerBound(),
        stats.getUpperBound(), stats.hasNull(), stats.getMinimum() == null,
        stats.getMaximum() == null);
  } else if (index instanceof DateColumnStatistics) {
    DateColumnStatistics stats = (DateColumnStatistics) index;
    ChronoLocalDate min = stats.getMinimumLocalDate();
    ChronoLocalDate max = stats.getMaximumLocalDate();
    return new ValueRange<>(predicate, min, max, stats.hasNull());
  } else if (index instanceof DecimalColumnStatistics) {
    DecimalColumnStatistics stats = (DecimalColumnStatistics) index;
    HiveDecimal min = stats.getMinimum();
    HiveDecimal max = stats.getMaximum();
    return new ValueRange<>(predicate, min, max, stats.hasNull());
  } else if (index instanceof TimestampColumnStatistics) {
    TimestampColumnStatistics stats = (TimestampColumnStatistics) index;
    Timestamp min = useUTCTimestamp ? stats.getMinimumUTC() : 
stats.getMinimum();
    Timestamp max = useUTCTimestamp ? stats.getMaximumUTC() : 
stats.getMaximum();
    return new ValueRange<>(predicate, min, max, stats.hasNull());
  } else if (index instanceof BooleanColumnStatistics) {
    BooleanColumnStatistics stats = (BooleanColumnStatistics) index;
    Boolean min = stats.getFalseCount() == 0;
    Boolean max = stats.getTrueCount() != 0;
    return new ValueRange<>(predicate, min, max, stats.hasNull());
  } else {
    return new ValueRange(predicate, null, null, true);
  }
}
 {code}

> Failed to read rows from the ORC file without statistics in RowIndex when 
> filter is pushed down for 1.6.11
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: ORC-1075
>                 URL: https://issues.apache.org/jira/browse/ORC-1075
>             Project: ORC
>          Issue Type: Bug
>          Components: Java, Reader
>    Affects Versions: 1.6.11
>            Reporter: Bobby Wang
>            Priority: Blocker
>         Attachments: none-1.orc
>
>
> I have attached an ORC file that seems not to include ColumnStatistics in 
> RowIndex.
> {color:#FF0000}From the ORC spec, seems RowIndex.ColumnStatistics is not a 
> required field ???{color}
>  
> {code:java}
> message RowIndexEntry {
>   repeated uint64 positions = 1 [packed=true];
>   optional ColumnStatistics statistics = 2;
> }
> message RowIndex {
>   repeated RowIndexEntry entry = 1;                                           
>              
> }
> {code}
> The meta of the ORC file
>  
> {code:java}
> $ orctools meta none.orc 
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.util.Shell).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Processing data file none.orc [length: 124]
> Structure for none.orc
> File Version: 0.12 with ORIGINAL
> Rows: 3
> Compression: NONE
> Calendar: Julian/Gregorian
> Type: struct<INT:int>
> Stripe Statistics:
>   Stripe 1:
>     Column 0: count: 3 hasNull: true
>     Column 1: count: 3 hasNull: true min: 1 max: 3 sum: 6
> File Statistics:
> Stripes:
>   Stripe: offset: 3 data: 4 rows: 3 tail: 32 index: 10
>     Stream: column 0 section ROW_INDEX start: 3 length 4
>     Stream: column 1 section ROW_INDEX start: 7 length 6
>     Stream: column 1 section DATA start: 13 length 4
>     Encoding column 0: DIRECT
>     Encoding column 1: DIRECT_V2
> File length: 124 bytes
> Padding length: 0 bytes
> Padding ratio: 0%
> {code}
>  
> the data of the orc file
> {code:java}
> $ orctools data none.orc 
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.util.Shell).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Processing data file none.orc [length: 124]
> {"INT":1}
> {"INT":2}
> {"INT":3}{code}
> I have below code trying to read each row of the ORC file
> {code:java}
> // Pick the schema we want to read using schema evolution
> TypeDescription readSchema =
> TypeDescription.fromString("struct<INT:int>");
> // Get the information from the file footer
> Reader reader = OrcFile.createReader(new Path("none.orc"),
>                 OrcFile.readerOptions(new Configuration()));
> System.out.println("File schema: " + reader.getSchema());
> System.out.println("Row count: " + reader.getNumberOfRows());
> RecordReader rowIterator = reader.rows(
>  reader.options()
>      .schema(readSchema)
>      .searchArgument(SearchArgumentFactory.newBuilder()
>          .equals("INT", PredicateLeaf.Type.LONG, 2L)
>      .build(), new String[]{"INT"}) //predict push down
> );
> // Read the row data
> VectorizedRowBatch batch = readSchema.createRowBatch();
> LongColumnVector x = (LongColumnVector) batch.cols[0];
> while (rowIterator.nextBatch(batch)) {
>   System.out.println(batch.size);
>   for (int row = 0; row < batch.size; ++row) {
>     int xRow = x.isRepeating ? 0 : row;
>     System.out.println("INT: " + (x.noNulls || !x.isNull[xRow] ?    
>                   x.vector[xRow] :null));
>   }
> }
> rowIterator.close();{code}
>  
> h2. output from 1.6.11
> File schema: struct<INT:int>
> Row count: 3
> h2. output from 1.5.10
> File schema: struct<INT:int>
> Row count: 3
> 3
> INT: 1
> INT: 2
> INT: 3
>  
> Actually, I found this issue on Spark 3.2 which depends on ORC 1.6.11, while 
> there is no such issue on spark 3.0.x which depends on ORC 1.5.10
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ORC-1075) Failed to read rows from the ORC file without statistics in RowIndex when filter is pushed down for 1.6.11

Reply via email to