[GitHub] [spark] sadikovi commented on a diff in pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

GitBox Sun, 04 Sep 2022 21:48:06 -0700


sadikovi commented on code in PR #37747:
URL: https://github.com/apache/spark/pull/37747#discussion_r962471116



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to 
avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation 
details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are 
optional, but the rest of
+      // the code here assumes they are not set, so normalize them to not 
being set. SPARK-40280
+      p.getLogicalTypeAnnotation match {
+        case la : IntLogicalTypeAnnotation if la.isSigned &&

Review Comment:
   nit: `case la: IntLogicalTypeAnnotation`



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala:
##########
@@ -370,6 +369,38 @@ abstract class ParquetFilterSuite extends QueryTest with 
ParquetTest with Shared
     }
   }
 
+  test("filter pushdown - int SPARK-40280") {

Review Comment:
   +1



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to 
avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation 
details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are 
optional, but the rest of
+      // the code here assumes they are not set, so normalize them to not 
being set. SPARK-40280
+      p.getLogicalTypeAnnotation match {
+        case la : IntLogicalTypeAnnotation if la.isSigned &&
+            la.getBitWidth == 32 && p.getPrimitiveTypeName == 
PrimitiveTypeName.INT32 =>
+          null
+        case la : IntLogicalTypeAnnotation if la.isSigned &&

Review Comment:
   nit: `case la: IntLogicalTypeAnnotation`



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to 
avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation 
details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are 
optional, but the rest of
+      // the code here assumes they are not set, so normalize them to not 
being set. SPARK-40280
+      p.getLogicalTypeAnnotation match {
+        case la : IntLogicalTypeAnnotation if la.isSigned &&

Review Comment:
   Also, what does "la" stand for?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to 
avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation 
details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are 
optional, but the rest of

Review Comment:
   nit: `SPARK-40280: Signed 64 ...`.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to 
avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation 
details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are 
optional, but the rest of
+      // the code here assumes they are not set, so normalize them to not 
being set. SPARK-40280

Review Comment:
   nit: Remove `SPARK-40280` at the end.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:
##########
@@ -59,6 +59,21 @@ class ParquetFilters(
   // nested columns. If any part of the names contains `dots`, it is quoted to 
avoid confusion.
   // See `org.apache.spark.sql.connector.catalog.quote` for implementation 
details.
   private val nameToParquetField : Map[String, ParquetPrimitiveField] = {
+    def getNormalizedLogicalType(p: PrimitiveType): LogicalTypeAnnotation = {
+      // signed 64 bits on an INT64 and signed 32 bits on an INT32 are 
optional, but the rest of
+      // the code here assumes they are not set, so normalize them to not 
being set. SPARK-40280
+      p.getLogicalTypeAnnotation match {
+        case la : IntLogicalTypeAnnotation if la.isSigned &&
+            la.getBitWidth == 32 && p.getPrimitiveTypeName == 
PrimitiveTypeName.INT32 =>
+          null
+        case la : IntLogicalTypeAnnotation if la.isSigned &&
+            la.getBitWidth == 64 && p.getPrimitiveTypeName == 
PrimitiveTypeName.INT64 =>

Review Comment:
   I would suggest rewriting it match statement like this, IMHO cleaner: 
   ```scala
   (p.getPrimitiveTypeName, p.getLogicalTypeAnnotation) match {
     case (INT32, intType: IntLogicalTypeAnnotation)
       if intType.getBitWidth() == 32 && intType.isSigned() => null
     case (INT64, intType: IntLogicalTypeAnnotation)
       if intType.getBitWidth() == 64 && intType.isSigned() => null
     case (_, otherType) => otherType
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sadikovi commented on a diff in pull request #37747: [SPARK-40280][SQL] Add support for parquet push down for annotated int and long

Reply via email to