[GitHub] [spark] c21 commented on a change in pull request #35090: [SPARK-37812][SQL] When deserializing an Orc struct, reuse the result row when possible

GitBox Tue, 04 Jan 2022 23:06:44 -0800


c21 commented on a change in pull request #35090:
URL: https://github.com/apache/spark/pull/35090#discussion_r778584969




##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
##########
@@ -844,6 +844,160 @@ abstract class OrcSourceSuite extends OrcSuite with 
SharedSparkSession {
       }
     }
   }
+
+  // Test the various cases during deserialization where we may or may not copy
+  // a struct writer's result row
+  test("struct in an array") {
+    withTempPath { file =>
+      val df = sql(
+        """SELECT
+          |  array(
+          |    named_struct(
+          |      'a1', 1,
+          |      'a2', 2),
+          |    named_struct(
+          |      'a1', 3,
+          |      'a2', 4)
+          |  ) as col1
+          |""".stripMargin)
+      df.write.orc(file.getCanonicalPath)
+      val df2 = spark.read.orc(file.getCanonicalPath)
+      checkAnswer(df2, df.collect().toSeq)
+    }
+  }
+
+  test("struct in a map #1") {

Review comment:
       nit: I think these all test cases can be written more concisely in one 
test. The only different is the `df` sql query.

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
##########
@@ -844,6 +844,160 @@ abstract class OrcSourceSuite extends OrcSuite with 
SharedSparkSession {
       }
     }
   }
+
+  // Test the various cases during deserialization where we may or may not copy
+  // a struct writer's result row
+  test("struct in an array") {

Review comment:
       nit: instead of comment, shall we have a more descriptive test name? e.g.
   
   ```
   test("SPARK-37812: Reuse result row when deserializing row with struct type")
   ```

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
##########
@@ -844,6 +844,160 @@ abstract class OrcSourceSuite extends OrcSuite with 
SharedSparkSession {
       }
     }
   }
+
+  // Test the various cases during deserialization where we may or may not copy
+  // a struct writer's result row
+  test("struct in an array") {
+    withTempPath { file =>
+      val df = sql(
+        """SELECT
+          |  array(
+          |    named_struct(
+          |      'a1', 1,
+          |      'a2', 2),
+          |    named_struct(
+          |      'a1', 3,
+          |      'a2', 4)
+          |  ) as col1
+          |""".stripMargin)
+      df.write.orc(file.getCanonicalPath)
+      val df2 = spark.read.orc(file.getCanonicalPath)
+      checkAnswer(df2, df.collect().toSeq)

Review comment:
       Could you test both vectorized and non-vectorized code path? i.e. 
enabling and disabling `spark.sql.orc.enableNestedColumnVectorizedReader`? The 
current test case is working, because vectorized read of struct is disabled, so 
we are going through the non-vectorized code path (the deserialization code 
path we are fixing here). I am worried about missing this test, when we decide 
to enable vectorized reader for nested column in upcoming future Spark release. 
Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on a change in pull request #35090: [SPARK-37812][SQL] When deserializing an Orc struct, reuse the result row when possible

Reply via email to