[GitHub] [spark] dongjoon-hyun commented on a change in pull request #32753: [SPARK-34859][SQL] Handle column index when using vectorized Parquet reader

GitBox Fri, 18 Jun 2021 14:30:53 -0700


dongjoon-hyun commented on a change in pull request #32753:
URL: https://github.com/apache/spark/pull/32753#discussion_r653970103




##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##########
@@ -340,6 +398,71 @@ public Binary readBinary(int len) {
     throw new UnsupportedOperationException("only readInts is valid.");
   }
 
+  @Override
+  public void skipIntegers(int total) {
+    int left = total;
+    while (left > 0) {
+      if (this.currentCount == 0) this.readNextGroup();
+      int n = Math.min(left, this.currentCount);
+      advance(n);
+      left -= n;
+    }
+  }
+
+  @Override
+  public void skipBooleans(int total) {
+    throw new UnsupportedOperationException("only skipIntegers is supported");

Review comment:
       Do we have a plan to support `skipBooleans`?

##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##########
@@ -340,6 +398,71 @@ public Binary readBinary(int len) {
     throw new UnsupportedOperationException("only readInts is valid.");
   }
 
+  @Override
+  public void skipIntegers(int total) {
+    int left = total;
+    while (left > 0) {
+      if (this.currentCount == 0) this.readNextGroup();
+      int n = Math.min(left, this.currentCount);
+      advance(n);
+      left -= n;
+    }
+  }
+
+  @Override
+  public void skipBooleans(int total) {
+    throw new UnsupportedOperationException("only skipIntegers is supported");

Review comment:
       `only skipIntegers is supported` has a little different meaning from 
`only skipIntegers is valid`. I'm curious what was the correct intention here.

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
##########
@@ -378,11 +380,77 @@ class ParquetIOSuite extends QueryTest with ParquetTest 
with SharedSparkSession
       .withWriterVersion(PARQUET_1_0)
       .withCompressionCodec(GZIP)
       .withRowGroupSize(1024 * 1024)
-      .withPageSize(1024)
+      .withPageSize(pageSize)
+      .withDictionaryPageSize(dictionaryPageSize)
       .withConf(hadoopConf)
       .build()
   }
 
+  test("test multiple pages with different sizes and nulls") {

Review comment:
       For this test case, shall we add `SPARK-34859: ` prefix?

##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
##########
@@ -61,6 +61,14 @@ public final void readBooleans(int total, 
WritableColumnVector c, int rowId) {
     }
   }
 
+  @Override
+  public final void skipBooleans(int total) {
+    // TODO: properly vectorize this

Review comment:
       Shall we file a JIRA and make this IDed TODO?

##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##########
@@ -340,6 +398,71 @@ public Binary readBinary(int len) {
     throw new UnsupportedOperationException("only readInts is valid.");
   }
 
+  @Override
+  public void skipIntegers(int total) {
+    int left = total;
+    while (left > 0) {
+      if (this.currentCount == 0) this.readNextGroup();
+      int n = Math.min(left, this.currentCount);
+      advance(n);
+      left -= n;
+    }
+  }
+
+  @Override
+  public void skipBooleans(int total) {
+    throw new UnsupportedOperationException("only skipIntegers is supported");

Review comment:
       +1 for changing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #32753: [SPARK-34859][SQL] Handle column index when using vectorized Parquet reader

Reply via email to