izchen opened a new pull request #29451:
URL: https://github.com/apache/spark/pull/29451


   ### What changes were proposed in this pull request?
   Remove the assertion in ParquetSchemaConverter that the parquet mapKey field 
must be PrimitiveType.
   
   
   ### Why are the changes needed?
   There is a parquet file in the attachment of 
[SPARK-32639](https://issues.apache.org/jira/browse/SPARK-32639), and the 
MessageType recorded in the file is:
   ```
   message parquet_schema {
     optional group value (MAP) {
       repeated group key_value {
         required group key {
           optional binary first (UTF8);
           optional binary middle (UTF8);
           optional binary last (UTF8);
         }
         optional binary value (UTF8);
       }
     }
   }
   ```
   
   Use `spark.read.parquet("000.snappy.parquet")` to read the file. Spark will 
throw an exception when converting Parquet MessageType to Spark SQL StructType:
   
   > AssertionError(Map key type is expected to be a primitive type, but 
found...)
   
   Use `spark.read.schema("value MAP<STRUCT<first:STRING, middle:STRING, 
last:STRING>, STRING>").parquet("000.snappy.parquet")` to read the file, spark 
returns the correct result .
   
   According to the parquet project document 
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), 
the mapKey in the parquet format does not need to be a primitive type.
   
   Note: This parquet file is not written by spark, because spark will write 
additional sparkSchema string information in the parquet file. When Spark 
reads, it will directly use the additional sparkSchema information in the file 
instead of converting Parquet MessageType to Spark SQL StructType.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Added a unit test case


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to