[ https://issues.apache.org/jira/browse/ORC-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469604#comment-17469604 ]
Bobby Wang commented on ORC-1075: --------------------------------- Even `index.getNumberOfValues()` can return the actual number, but because there is no ColumnStatistic in RowIndex, so the index will not match any Integer/Double ... ColumnStatistics, finally it goes into {code:java} new ValueRange(predicate, null, null, true);{code} and get the same result. {code:java} // 1.8.0-SNAPSHOT static ValueRange getValueRange(ColumnStatistics index, PredicateLeaf predicate, boolean useUTCTimestamp) { if (index.getNumberOfValues() == 0) { return new ValueRange<>(predicate, index.hasNull()); } else if (index instanceof IntegerColumnStatistics) { IntegerColumnStatistics stats = (IntegerColumnStatistics) index; Long min = stats.getMinimum(); Long max = stats.getMaximum(); return new ValueRange<>(predicate, min, max, stats.hasNull()); } else if (index instanceof CollectionColumnStatistics) { CollectionColumnStatistics stats = (CollectionColumnStatistics) index; Long min = stats.getMinimumChildren(); Long max = stats.getMaximumChildren(); return new ValueRange<>(predicate, min, max, stats.hasNull()); }else if (index instanceof DoubleColumnStatistics) { DoubleColumnStatistics stats = (DoubleColumnStatistics) index; Double min = stats.getMinimum(); Double max = stats.getMaximum(); return new ValueRange<>(predicate, min, max, stats.hasNull()); } else if (index instanceof StringColumnStatistics) { StringColumnStatistics stats = (StringColumnStatistics) index; return new ValueRange<>(predicate, stats.getLowerBound(), stats.getUpperBound(), stats.hasNull(), stats.getMinimum() == null, stats.getMaximum() == null); } else if (index instanceof DateColumnStatistics) { DateColumnStatistics stats = (DateColumnStatistics) index; ChronoLocalDate min = stats.getMinimumLocalDate(); ChronoLocalDate max = stats.getMaximumLocalDate(); return new ValueRange<>(predicate, min, max, stats.hasNull()); } else if (index instanceof DecimalColumnStatistics) { DecimalColumnStatistics stats = (DecimalColumnStatistics) index; HiveDecimal min = stats.getMinimum(); HiveDecimal max = stats.getMaximum(); return new ValueRange<>(predicate, min, max, stats.hasNull()); } else if (index instanceof TimestampColumnStatistics) { TimestampColumnStatistics stats = (TimestampColumnStatistics) index; Timestamp min = useUTCTimestamp ? stats.getMinimumUTC() : stats.getMinimum(); Timestamp max = useUTCTimestamp ? stats.getMaximumUTC() : stats.getMaximum(); return new ValueRange<>(predicate, min, max, stats.hasNull()); } else if (index instanceof BooleanColumnStatistics) { BooleanColumnStatistics stats = (BooleanColumnStatistics) index; Boolean min = stats.getFalseCount() == 0; Boolean max = stats.getTrueCount() != 0; return new ValueRange<>(predicate, min, max, stats.hasNull()); } else { return new ValueRange(predicate, null, null, true); } } {code} > Failed to read rows from the ORC file without statistics in RowIndex when > filter is pushed down for 1.6.11 > ---------------------------------------------------------------------------------------------------------- > > Key: ORC-1075 > URL: https://issues.apache.org/jira/browse/ORC-1075 > Project: ORC > Issue Type: Bug > Components: Java, Reader > Affects Versions: 1.6.11 > Reporter: Bobby Wang > Priority: Blocker > Attachments: none-1.orc > > > I have attached an ORC file that seems not to include ColumnStatistics in > RowIndex. > {color:#FF0000}From the ORC spec, seems RowIndex.ColumnStatistics is not a > required field ???{color} > > {code:java} > message RowIndexEntry { > repeated uint64 positions = 1 [packed=true]; > optional ColumnStatistics statistics = 2; > } > message RowIndex { > repeated RowIndexEntry entry = 1; > > } > {code} > The meta of the ORC file > > {code:java} > $ orctools meta none.orc > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.util.Shell). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Processing data file none.orc [length: 124] > Structure for none.orc > File Version: 0.12 with ORIGINAL > Rows: 3 > Compression: NONE > Calendar: Julian/Gregorian > Type: struct<INT:int> > Stripe Statistics: > Stripe 1: > Column 0: count: 3 hasNull: true > Column 1: count: 3 hasNull: true min: 1 max: 3 sum: 6 > File Statistics: > Stripes: > Stripe: offset: 3 data: 4 rows: 3 tail: 32 index: 10 > Stream: column 0 section ROW_INDEX start: 3 length 4 > Stream: column 1 section ROW_INDEX start: 7 length 6 > Stream: column 1 section DATA start: 13 length 4 > Encoding column 0: DIRECT > Encoding column 1: DIRECT_V2 > File length: 124 bytes > Padding length: 0 bytes > Padding ratio: 0% > {code} > > the data of the orc file > {code:java} > $ orctools data none.orc > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.util.Shell). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Processing data file none.orc [length: 124] > {"INT":1} > {"INT":2} > {"INT":3}{code} > I have below code trying to read each row of the ORC file > {code:java} > // Pick the schema we want to read using schema evolution > TypeDescription readSchema = > TypeDescription.fromString("struct<INT:int>"); > // Get the information from the file footer > Reader reader = OrcFile.createReader(new Path("none.orc"), > OrcFile.readerOptions(new Configuration())); > System.out.println("File schema: " + reader.getSchema()); > System.out.println("Row count: " + reader.getNumberOfRows()); > RecordReader rowIterator = reader.rows( > reader.options() > .schema(readSchema) > .searchArgument(SearchArgumentFactory.newBuilder() > .equals("INT", PredicateLeaf.Type.LONG, 2L) > .build(), new String[]{"INT"}) //predict push down > ); > // Read the row data > VectorizedRowBatch batch = readSchema.createRowBatch(); > LongColumnVector x = (LongColumnVector) batch.cols[0]; > while (rowIterator.nextBatch(batch)) { > System.out.println(batch.size); > for (int row = 0; row < batch.size; ++row) { > int xRow = x.isRepeating ? 0 : row; > System.out.println("INT: " + (x.noNulls || !x.isNull[xRow] ? > x.vector[xRow] :null)); > } > } > rowIterator.close();{code} > > h2. output from 1.6.11 > File schema: struct<INT:int> > Row count: 3 > h2. output from 1.5.10 > File schema: struct<INT:int> > Row count: 3 > 3 > INT: 1 > INT: 2 > INT: 3 > > Actually, I found this issue on Spark 3.2 which depends on ORC 1.6.11, while > there is no such issue on spark 3.0.x which depends on ORC 1.5.10 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)