[GitHub] [spark] sunchao commented on a change in pull request #32753: [SPARK-34859][SQL] Handle column index when using vectorized Parquet reader

GitBox Fri, 18 Jun 2021 14:06:35 -0700


sunchao commented on a change in pull request #32753:
URL: https://github.com/apache/spark/pull/32753#discussion_r654010520




##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##########
@@ -340,6 +398,71 @@ public Binary readBinary(int len) {
     throw new UnsupportedOperationException("only readInts is valid.");
   }
 
+  @Override
+  public void skipIntegers(int total) {
+    int left = total;
+    while (left > 0) {
+      if (this.currentCount == 0) this.readNextGroup();
+      int n = Math.min(left, this.currentCount);
+      advance(n);
+      left -= n;
+    }
+  }
+
+  @Override
+  public void skipBooleans(int total) {
+    throw new UnsupportedOperationException("only skipIntegers is supported");

Review comment:
       I can change it to `only skipIntegers is valid`. It's because Parquet 
RLE encoding only support integers which is why all the others are 
unimplemented.

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
##########
@@ -378,11 +380,77 @@ class ParquetIOSuite extends QueryTest with ParquetTest 
with SharedSparkSession
       .withWriterVersion(PARQUET_1_0)
       .withCompressionCodec(GZIP)
       .withRowGroupSize(1024 * 1024)
-      .withPageSize(1024)
+      .withPageSize(pageSize)
+      .withDictionaryPageSize(dictionaryPageSize)
       .withConf(hadoopConf)
       .build()
   }
 
+  test("test multiple pages with different sizes and nulls") {

Review comment:
       Sure wil do

##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
##########
@@ -61,6 +61,14 @@ public final void readBooleans(int total, 
WritableColumnVector c, int rowId) {
     }
   }
 
+  @Override
+  public final void skipBooleans(int total) {
+    // TODO: properly vectorize this

Review comment:
       This follows a few other TODOs in the file when handling booleans. Yes I 
can file a JIRA too




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a change in pull request #32753: [SPARK-34859][SQL] Handle column index when using vectorized Parquet reader

Reply via email to