HyukjinKwon commented on a change in pull request #33006:
URL: https://github.com/apache/spark/pull/33006#discussion_r655847305
##########
File path:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -174,24 +162,29 @@ void readBatch(int total, WritableColumnVector column)
throws IOException {
// page.
dictionaryIds = column.reserveDictionaryIds(total);
}
- while (total > 0) {
+ readState.resetForBatch(total);
+ while (readState.valuesToReadInBatch > 0) {
// Compute the number of values we want to read in this page.
- int leftInPage = (int) (endOfPageValueCount - valuesRead);
- if (leftInPage == 0) {
+ if (readState.valuesToReadInPage == 0) {
readPage();
- leftInPage = (int) (endOfPageValueCount - valuesRead);
+ readState.resetForPage(pageValueCount);
}
- int num = Math.min(total, leftInPage);
PrimitiveType.PrimitiveTypeName typeName =
descriptor.getPrimitiveType().getPrimitiveTypeName();
if (isCurrentPageDictionaryEncoded) {
+ boolean supportLazyDecoding = readState.offset == 0 &&
+ isLazyDecodingSupported(typeName);
+
+ // Save starting offset in case we need to decode dictionary IDs.
+ int startOffset = readState.offset;
+
// Read and decode dictionary ids.
- defColumn.readIntegers(
- num, dictionaryIds, column, rowId, maxDefLevel,
(VectorizedValuesReader) dataColumn);
+ defColumn.readIntegers(readState, dictionaryIds, column,
+ (VectorizedValuesReader) dataColumn);
// TIMESTAMP_MILLIS encoded as INT64 can't be lazily decoded as we
need to post process
// the values to add microseconds precision.
- if (column.hasDictionary() || (rowId == 0 &&
isLazyDecodingSupported(typeName))) {
+ if (column.hasDictionary() || supportLazyDecoding) {
Review comment:
Just to make sure we don't miss anything, there's a small side effect
here. Previously `(rowId == 0 && isLazyDecodingSupported(typeName))` wasn't
executed when `column.hasDictionary()` true, but now it's executed alway.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]