Chen Zhang created SPARK-32639:
----------------------------------

             Summary: Support GroupType parquet mapkey field
                 Key: SPARK-32639
                 URL: https://issues.apache.org/jira/browse/SPARK-32639
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.0, 2.4.6
            Reporter: Chen Zhang


I have a parquet file, and the MessageType recorded in the file is:
{code:java}
message parquet_schema {
  optional group value (MAP) {
    repeated group key_value {
      required group key {
        optional binary first (UTF8);
        optional binary middle (UTF8);
        optional binary last (UTF8);
      }
      optional binary value (UTF8);
    }
  }
}{code}
 

Use +spark.read.parquet("000.snappy.parquet")+ to read the file. Spark will 
throw an exception when converting Parquet MessageType to Spark SQL StructType:
{code:java}
AssertionError(Map key type is expected to be a primitive type, but found...)
{code}
 

Use +spark.read.schema("value MAP<STRUCT<first:STRING, middle:STRING, 
last:STRING>, STRING>").parquet("000.snappy.parquet")+ to read the file, spark 
returns the correct result .

According to the parquet project document 
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), 
the mapKey in the parquet format does not need to be a primitive type.

 
Note: This parquet file is not written by spark, because spark will write 
additional sparkSchema string information in the parquet file. When Spark 
reads, it will directly use the additional sparkSchema information in the file 
instead of converting Parquet MessageType to Spark SQL StructType.


I will submit a PR later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to