sunchao commented on a change in pull request #34659:
URL: https://github.com/apache/spark/pull/34659#discussion_r838976941
##########
File path:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -214,24 +231,32 @@ void readBatch(int total, WritableColumnVector column)
throws IOException {
boolean needTransform = castLongToInt || isUnsignedInt32 ||
isUnsignedInt64;
column.setDictionary(new ParquetDictionary(dictionary,
needTransform));
} else {
- updater.decodeDictionaryIds(readState.offset - startOffset,
startOffset, column,
+ updater.decodeDictionaryIds(readState.valueOffset - startOffset,
startOffset, column,
dictionaryIds, dictionary);
}
} else {
- if (column.hasDictionary() && readState.offset != 0) {
+ if (column.hasDictionary() && readState.valueOffset != 0) {
// This batch already has dictionary encoded values but this new
page is not. The batch
// does not support a mix of dictionary and not so we will decode
the dictionary.
- updater.decodeDictionaryIds(readState.offset, 0, column,
dictionaryIds, dictionary);
+ updater.decodeDictionaryIds(readState.valueOffset, 0, column,
dictionaryIds, dictionary);
}
column.setDictionary(null);
VectorizedValuesReader valuesReader = (VectorizedValuesReader)
dataColumn;
- defColumn.readBatch(readState, column, valuesReader, updater);
+ if (readState.maxRepetitionLevel == 0) {
+ defColumn.readBatch(readState, column, definitionLevels,
valuesReader, updater);
+ } else {
+ repColumn.readBatchNested(readState, repetitionLevels, defColumn,
definitionLevels,
+ column, valuesReader, updater);
+ }
}
}
}
private int readPage() {
DataPage page = pageReader.readPage();
+ if (page == null) {
+ return -1;
+ }
Review comment:
Yes, I have some comments in line 182 above:
> we've read all the pages; this could happen when we're reading a repeated
list and we don't know where the list will end until we've seen all the pages.
For primitive types, we know a) the exact number of total values to read,
and 2) the total number of values for each page. Therefore we know which page
is the last one.
However, a repeated list could span multiple pages in Parquet, and therefore
we don't really know which page is the last, until the Parquet page returns
null.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]