james-willis opened a new issue, #2704:
URL: https://github.com/apache/sedona/issues/2704

   ## Expected behavior
   
   On Spark 4.1+, the `TransformNestedUDTParquet` optimizer rule should not be 
registered, since the root cause it works around (SPARK-48942) has been fixed 
natively by [SPARK-52651](https://issues.apache.org/jira/browse/SPARK-52651).
   
   ## Actual behavior
   
   The `TransformNestedUDTParquet` rule is unconditionally registered on all 
Spark versions, including 4.1+ where it is unnecessary. While not a crash bug, 
it adds an unnecessary optimizer rule that modifies plan output attributes on 
versions where Spark handles UDTs in the vectorized Parquet reader natively.
   
   ## Steps to reproduce the problem
   
   1. Run Sedona on Spark 4.1+
   2. Read a GeoParquet file with nested geometry columns (e.g., array of 
struct containing GeometryUDT)
   3. Observe that `TransformNestedUDTParquet` still transforms the schema even 
though Spark 4.1 handles it
   
   ## Settings
   
   Sedona version = 1.8.x / master
   
   Apache Spark version = 4.1+
   
   API type = Scala
   
   ## Context
   
   - PR #2359 introduced `TransformNestedUDTParquet` to work around 
SPARK-48942, which caused the vectorized Parquet reader to crash on nested UDTs.
   - [SPARK-52651](https://issues.apache.org/jira/browse/SPARK-52651) (merged 
in Spark 4.1) fixes this at the Spark level by recursively stripping UDTs in 
`ColumnVector`.
   - The workaround should be version-gated to only run on Spark < 4.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to