Kristin Cowalcijk created SEDONA-457:
----------------------------------------

             Summary: Don't write GeometryUDT into 
{{org.apache.spark.sql.parquet.row.metadata}} when writing GeoParquet files
                 Key: SEDONA-457
                 URL: https://issues.apache.org/jira/browse/SEDONA-457
             Project: Apache Sedona
          Issue Type: Improvement
            Reporter: Kristin Cowalcijk
             Fix For: 1.5.1


Spark SQL primarily uses {{org.apache.spark.sql.parquet.row.metadata}} to infer 
the schema of parquet files. It will fall back to using the native parquet 
schema only when {{org.apache.spark.sql.parquet.row.metadata}} is absent.

Writing the schema of dataframes with GeometryUDT columns into 
{{org.apache.spark.sql.parquet.row.metadata}} may cause compatibility problems 
with older versions of Apache Sedona. Additionally, there will be a warning 
when reading such GeoParquet files using vallina Spark SQL:
{code:java}
>>> df = 
>>> spark.read.format("parquet").load('/home/kontinuation/local/iceberg/test_geoparquet_points')
23/12/27 17:43:56 WARN ParquetFileFormat: Failed to parse and ignored 
serialized Spark schema in Parquet key-value metadata:
        
{"type":"struct","fields":[{"name":"user_id","type":"long","nullable":true,"metadata":{}},{"name":"speed","type":"double","nullable":true,"metadata":{}},{"name":"geom","type":{"type":"udt","class":"org.apache.spark.sql.sedona_sql.UDT.GeometryUDT","pyClass":"sedona.sql.types.GeometryType","sqlType":"binary"},"nullable":true,"metadata":{}}]}
java.lang.IllegalArgumentException: Unsupported dataType: 
{"type":"struct","fields":[{"name":"user_id","type":"long","nullable":true,"metadata":{}},{"name":"speed","type":"double","nullable":true,"metadata":{}},{"name":"geom","type":{"type":"udt","class":"org.apache.spark.sql.sedona_sql.UDT.GeometryUDT","pyClass":"sedona.sql.types.GeometryType","sqlType":"binary"},"nullable":true,"metadata":{}}]},
 [1.1] failure: 'TimestampType' expected but '{' found

{"type":"struct","fields":[{"name":"user_id","type":"long","nullable":true,"metadata":{}},{"name":"speed","type":"double","nullable":true,"metadata":{}},{"name":"geom","type":{"type":"udt","class":"org.apache.spark.sql.sedona_sql.UDT.GeometryUDT","pyClass":"sedona.sql.types.GeometryType","sqlType":"binary"},"nullable":true,"metadata":{}}]}
^
        at 
org.apache.spark.sql.catalyst.parser.LegacyTypeStringParser$.parseString(LegacyTypeStringParser.scala:90)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$deserializeSchemaString$2.applyOrElse(ParquetFileFormat.scala:521)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$deserializeSchemaString$2.applyOrElse(ParquetFileFormat.scala:516)
        at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
        at scala.util.Failure.recover(Try.scala:234)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.deserializeSchemaString(ParquetFileFormat.scala:516)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$1(ParquetFileFormat.scala:509)
        at scala.Option.flatMap(Option.scala:271)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:509)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:491)
        at scala.collection.immutable.Stream.map(Stream.scala:418)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:491)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:484)
        at 
org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:75)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:463)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:466)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{code}
We suggest changing the data types of geometry columns to binary for maximum 
compatibility.

Please refer to the discussion in 
[OvertureMaps/data/issues/89|https://github.com/OvertureMaps/data/issues/89#issuecomment-1820154760]
 for the background of this proposal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to