[GitHub] [spark] cloud-fan commented on a change in pull request #33006: [SPARK-35846][SQL] Introduce ParquetReadState to track various states while reading a Parquet column chunk

GitBox Tue, 22 Jun 2021 22:48:17 -0700


cloud-fan commented on a change in pull request #33006:
URL: https://github.com/apache/spark/pull/33006#discussion_r656777628




##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -216,53 +195,49 @@ void readBatch(int total, WritableColumnVector column) 
throws IOException {
           boolean needTransform = castLongToInt || isUnsignedInt32 || 
isUnsignedInt64;
           column.setDictionary(new ParquetDictionary(dictionary, 
needTransform));
         } else {
-          updater.decodeDictionaryIds(num, rowId, column, dictionaryIds, 
dictionary);
+          updater.decodeDictionaryIds(readState.offset - startOffset, 
startOffset, column,
+            dictionaryIds, dictionary);
         }
       } else {
-        if (column.hasDictionary() && rowId != 0) {
+        if (column.hasDictionary() && readState.offset != 0) {
           // This batch already has dictionary encoded values but this new 
page is not. The batch
           // does not support a mix of dictionary and not so we will decode 
the dictionary.
-          updater.decodeDictionaryIds(rowId, 0, column, dictionaryIds, 
dictionary);
+          updater.decodeDictionaryIds(readState.offset, 0, column, 
dictionaryIds, dictionary);
         }
         column.setDictionary(null);
         VectorizedValuesReader valuesReader = (VectorizedValuesReader) 
dataColumn;
-        defColumn.readBatch(num, rowId, column, maxDefLevel, valuesReader, 
updater);
+        defColumn.readBatch(readState, column, valuesReader, updater);
       }
-
-      valuesRead += num;
-      rowId += num;
-      total -= num;
     }
   }
 
-  private void readPage() {
+  private int readPage() {
     DataPage page = pageReader.readPage();
-    // TODO: Why is this a visitor?
-    page.accept(new DataPage.Visitor<Void>() {
+    return page.accept(new DataPage.Visitor<Integer>() {
       @Override
-      public Void visit(DataPageV1 dataPageV1) {
+      public Integer visit(DataPageV1 dataPageV1) {

Review comment:
       ah I see, let's leave it then.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #33006: [SPARK-35846][SQL] Introduce ParquetReadState to track various states while reading a Parquet column chunk

Reply via email to