Kristin Cowalcijk created SEDONA-457: ----------------------------------------
Summary: Don't write GeometryUDT into {{org.apache.spark.sql.parquet.row.metadata}} when writing GeoParquet files Key: SEDONA-457 URL: https://issues.apache.org/jira/browse/SEDONA-457 Project: Apache Sedona Issue Type: Improvement Reporter: Kristin Cowalcijk Fix For: 1.5.1 Spark SQL primarily uses {{org.apache.spark.sql.parquet.row.metadata}} to infer the schema of parquet files. It will fall back to using the native parquet schema only when {{org.apache.spark.sql.parquet.row.metadata}} is absent. Writing the schema of dataframes with GeometryUDT columns into {{org.apache.spark.sql.parquet.row.metadata}} may cause compatibility problems with older versions of Apache Sedona. Additionally, there will be a warning when reading such GeoParquet files using vallina Spark SQL: {code:java} >>> df = >>> spark.read.format("parquet").load('/home/kontinuation/local/iceberg/test_geoparquet_points') 23/12/27 17:43:56 WARN ParquetFileFormat: Failed to parse and ignored serialized Spark schema in Parquet key-value metadata: {"type":"struct","fields":[{"name":"user_id","type":"long","nullable":true,"metadata":{}},{"name":"speed","type":"double","nullable":true,"metadata":{}},{"name":"geom","type":{"type":"udt","class":"org.apache.spark.sql.sedona_sql.UDT.GeometryUDT","pyClass":"sedona.sql.types.GeometryType","sqlType":"binary"},"nullable":true,"metadata":{}}]} java.lang.IllegalArgumentException: Unsupported dataType: {"type":"struct","fields":[{"name":"user_id","type":"long","nullable":true,"metadata":{}},{"name":"speed","type":"double","nullable":true,"metadata":{}},{"name":"geom","type":{"type":"udt","class":"org.apache.spark.sql.sedona_sql.UDT.GeometryUDT","pyClass":"sedona.sql.types.GeometryType","sqlType":"binary"},"nullable":true,"metadata":{}}]}, [1.1] failure: 'TimestampType' expected but '{' found {"type":"struct","fields":[{"name":"user_id","type":"long","nullable":true,"metadata":{}},{"name":"speed","type":"double","nullable":true,"metadata":{}},{"name":"geom","type":{"type":"udt","class":"org.apache.spark.sql.sedona_sql.UDT.GeometryUDT","pyClass":"sedona.sql.types.GeometryType","sqlType":"binary"},"nullable":true,"metadata":{}}]} ^ at org.apache.spark.sql.catalyst.parser.LegacyTypeStringParser$.parseString(LegacyTypeStringParser.scala:90) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$deserializeSchemaString$2.applyOrElse(ParquetFileFormat.scala:521) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$deserializeSchemaString$2.applyOrElse(ParquetFileFormat.scala:516) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at scala.util.Failure.recover(Try.scala:234) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.deserializeSchemaString(ParquetFileFormat.scala:516) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$1(ParquetFileFormat.scala:509) at scala.Option.flatMap(Option.scala:271) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:509) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:491) at scala.collection.immutable.Stream.map(Stream.scala:418) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:491) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:484) at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:75) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:463) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:466) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {code} We suggest changing the data types of geometry columns to binary for maximum compatibility. Please refer to the discussion in [OvertureMaps/data/issues/89|https://github.com/OvertureMaps/data/issues/89#issuecomment-1820154760] for the background of this proposal. -- This message was sent by Atlassian Jira (v8.20.10#820010)