Re: [PR] [GH-2240] Fix write and read nested geometry array using vectorized parquet reader [sedona]

via GitHub Fri, 19 Sep 2025 11:08:18 -0700


Copilot commented on code in PR #2359:
URL: https://github.com/apache/sedona/pull/2359#discussion_r2363834807



##########
spark/common/src/main/java/org/apache/spark/sql/execution/datasources/geoparquet/internal/ParquetColumnVector.java:
##########
@@ -31,11 +33,64 @@
 import org.apache.spark.sql.types.DataTypes;
 import org.apache.spark.sql.types.MapType;
 import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.types.UserDefinedType;
 
 /**
  * Contains necessary information representing a Parquet column, either of 
primitive or nested type.
  */
 final class ParquetColumnVector {
+  /** Cache for type compatibility checks to avoid redundant recursive 
computations */
+  private static final Map<TypePair, Boolean> TYPE_COMPATIBILITY_CACHE = new 
ConcurrentHashMap<>();
+
+  /** Key class for caching type pairs */
+  private static class TypePair {
+    private final DataType type1;
+    private final DataType type2;
+    private final int cachedHashCode;
+
+    TypePair(DataType type1, DataType type2) {
+      this.type1 = type1;
+      this.type2 = type2;
+      // Pre-compute hash code for efficiency

Review Comment:
   [nitpick] The comment mentions efficiency but doesn't explain why 
pre-computing is more efficient. Consider clarifying that this avoids repeated 
hash code calculations during map operations.
   ```suggestion
         // Pre-compute hash code to avoid repeated hash code calculations 
during map operations,
         // which improves efficiency when TypePair is used as a key in 
hash-based collections.
   ```



##########
spark/common/src/main/java/org/apache/spark/sql/execution/datasources/geoparquet/internal/ParquetColumnVector.java:
##########
@@ -405,4 +460,94 @@ private int getCollectionSize(int maxRepetitionLevel, int 
idx) {
     }
     return size;
   }
+
+  /**
+   * Checks if two data types are compatible for Parquet column vector 
operations. This method
+   * handles the special case where one type is a UserDefinedType (UDT) and 
the other is the UDT's
+   * sqlType, which should be considered compatible.
+   *
+   * <p>This fixes SPARK-48942 where nested GeometryUDT fields fail to read 
from Parquet due to type
+   * mismatch between the logical schema (GeometryUDT) and physical schema 
(BinaryType).
+   *
+   * @param type1 First data type to compare
+   * @param type2 Second data type to compare
+   * @return true if the types are compatible, false otherwise
+   */
+  private static boolean isCompatibleType(DataType type1, DataType type2) {
+    // Fast path: For most regular cases, types are exactly the same
+    // This avoids cache overhead for simple comparisons

Review Comment:
   [nitpick] The comment could be more specific about what constitutes 'cache 
overhead'. Consider explaining that it avoids the cost of TypePair object 
creation and map operations for identical types.
   ```suggestion
       // Fast path: For most regular cases, types are exactly the same.
       // This avoids the overhead of TypePair object creation and cache map 
operations
       // (lookup and insertion) for simple comparisons where types are 
identical.
   ```



##########
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/TransformNestedUDTParquet.scala:
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.spark.sql.sedona_sql.UDT
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.types._
+
+/**
+ * Catalyst rule that automatically transforms schemas with nested GeometryUDT 
to prevent
+ * SPARK-48942 errors in Parquet reading.
+ *
+ * This rule detects LogicalRelations that use ParquetFileFormat and have 
nested GeometryUDT in
+ * their schema, then transforms the schema to use BinaryType instead.
+ */
+class TransformNestedUDTParquet(spark: SparkSession) extends Rule[LogicalPlan] 
{
+
+  override def apply(plan: LogicalPlan): LogicalPlan = {
+    plan.transformUp {
+      case lr: LogicalRelation
+          if lr.relation.isInstanceOf[HadoopFsRelation] &&
+            lr.relation
+              .asInstanceOf[HadoopFsRelation]
+              .fileFormat
+              .isInstanceOf[ParquetFileFormat] &&
+            hasNestedGeometryUDT(lr.schema) =>
+        val relation = lr.relation.asInstanceOf[HadoopFsRelation]
+
+        // Transform the schema to use BinaryType for nested GeometryUDT
+        val transformedSchema = transformSchemaForNestedUDT(lr.schema)
+
+        // Create new AttributeReferences with transformed data types
+        val transformedAttributes = transformedSchema.fields.zipWithIndex.map {
+          case (field, index) =>
+            val originalAttr = lr.output(index)
+            AttributeReference(field.name, field.dataType, field.nullable, 
field.metadata)(
+              originalAttr.exprId,
+              originalAttr.qualifier)
+        }
+        lr.copy(output = transformedAttributes)
+
+      case other => other
+    }
+  }
+
+  /**
+   * Check if a schema contains nested GeometryUDT (i.e., GeometryUDT inside 
arrays, maps, or
+   * structs). Top-level GeometryUDT fields are NOT considered nested.

Review Comment:
   [nitpick] The documentation should clarify what 'Top-level GeometryUDT 
fields are NOT considered nested' means in the context of the method's return 
value and why this distinction matters for the SPARK-48942 fix.
   ```suggestion
      * Checks if a schema contains nested GeometryUDT fields, meaning 
GeometryUDT instances
      * that are inside arrays, maps, or structs (i.e., not at the top level of 
the schema).
      * Top-level GeometryUDT fields (fields whose type is GeometryUDT directly 
in the StructType)
      * are NOT considered nested and do not trigger the transformation. This 
distinction is important
      * because the SPARK-48942 Parquet bug only affects nested UDTs, not 
top-level ones. Therefore,
      * this method returns true only if a GeometryUDT is found inside a 
container type, ensuring
      * that only affected fields are transformed.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2240] Fix write and read nested geometry array using vectorized parquet reader [sedona]

Reply via email to