[GitHub] [spark] sunchao commented on a change in pull request #33888: [SPARK-36634][SQL] Support access and read parquet file by column index

GitBox Wed, 01 Sep 2021 09:36:30 -0700


sunchao commented on a change in pull request #33888:
URL: https://github.com/apache/spark/pull/33888#discussion_r700366823




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -903,6 +903,13 @@ object SQLConf {
     .intConf
     .createWithDefault(4096)
 
+  val PARQUET_COLUMN_INDEX_ACCESS = 
buildConf("spark.sql.parquet.columnIndexAccess")

Review comment:
       Column index is a special feature in Parquet and this may cause 
confusion. How about `spark.sql.parquet.accessByIndex`?

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
##########
@@ -4223,6 +4223,52 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
     checkAnswer(sql("""SELECT from_json(r'{"a": "\\"}', 'a string')"""), 
Row(Row("\\")))
     checkAnswer(sql("""SELECT from_json(R'{"a": "\\"}', 'a string')"""), 
Row(Row("\\")))
   }
+
+  test("SPARK-36634: Support access and read parquet file by column index") {
+    withTempDir { dir =>

Review comment:
       We should test with both `spark.sql.parquet.enableVectorizedReader` 
enabled or disabled. I think it currently doesn't work for the latter.

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -903,6 +903,13 @@ object SQLConf {
     .intConf
     .createWithDefault(4096)
 
+  val PARQUET_COLUMN_INDEX_ACCESS = 
buildConf("spark.sql.parquet.columnIndexAccess")
+    .doc("When true, we access the parquet files by column index instead of 
catalyst schema" +

Review comment:
       nit: parquet -> Parquet

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
##########
@@ -4223,6 +4223,52 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
     checkAnswer(sql("""SELECT from_json(r'{"a": "\\"}', 'a string')"""), 
Row(Row("\\")))
     checkAnswer(sql("""SELECT from_json(R'{"a": "\\"}', 'a string')"""), 
Row(Row("\\")))
   }
+
+  test("SPARK-36634: Support access and read parquet file by column index") {
+    withTempDir { dir =>
+      val loc = s"file:///$dir/t"
+
+      withTable("t1", "t2", "t3") {
+        sql(s"create table t1 (my_id int, my_name string) using parquet 
location '$loc'")
+        sql(s"create table t2 (myid int, myName string) using parquet location 
'$loc'")
+        sql("insert into t1 select 1, 'kent'")
+        sql("insert into t2 select 2, 'yao'")
+        sql("insert into t2 select 3, 'kyuubi'")

Review comment:
       nit: I'd suggest to use more neutral words here :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a change in pull request #33888: [SPARK-36634][SQL] Support access and read parquet file by column index

Reply via email to