Re: [I] [Python] Reading parquet file generated by pyspark throw `Logical type Null can not be applied to group node` [arrow]

via GitHub Thu, 18 Jan 2024 00:11:52 -0800


jorisvandenbossche commented on issue #39540:
URL: https://github.com/apache/arrow/issues/39540#issuecomment-1897992749


   Yes, I can confirm the error, also with the latest development version of 
Arrow. 
   
   Opening the file and printing the schema there:
   
   ```
   In [63]: import fastparquet
   
   In [64]: f = 
fastparquet.ParquetFile("../Downloads/sample/part-00000-e69412f4-236c-436a-a4cd-89318d2aaa3d-c000.snappy.parquet")
   
   In [65]: print(f.schema.text)
   - spark_schema: 
   | - id: BYTE_ARRAY, STRING, UTF8, OPTIONAL
   | - email_sha256: BYTE_ARRAY, STRING, UTF8, OPTIONAL
   | - params: MAP, MAP, OPTIONAL
   |   - map: UNKNOWN, MAP_KEY_VALUE, REPEATED
   |   | - key: BYTE_ARRAY, STRING, UTF8, REQUIRED
   |     - value: BYTE_ARRAY, STRING, UTF8, OPTIONAL
     - master_id: BYTE_ARRAY, STRING, UTF8, OPTIONAL
   ```
   
   I suppose the error is coming from the `MAP` type column "params" (since the 
other columns are simple, non-nested columns). 
   
   According to the spec 
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), 
map columns consist of a 3-level structure, as is also shown above. The second 
(middle) level "named `key_value`, must be a repeated group". 
   So as a start, it seems that Spark doesn't use the correct name, since above 
that level is named "map" and not "key_value". 
   
   But I assume the issue is that this middle level is annotated with the 
"UNKNOWN" logical type. And in our code, we don't allow a group node to have 
such logical type (to be my understanding, this annotation also doesn't make 
sense, as it should be used to indicate that all values in that column are 
null). 
   
   Looking at the parquet-mr repo, it seems this was fixed in version 1.12 
(https://github.com/apache/parquet-mr/pull/798 / 
https://issues.apache.org/jira/browse/PARQUET-1879). So if you update your 
spark (and parquet-mr) version, and write the file again, then I assume it will 
be readable by Arrow.
   
   Now, this still means that we cannot read files written by parquet-mr < 
1.12. I assume also on the Arrow side it should be possible to add some 
workaround to ignore the UNKNOWN logical type of a group node _if_ the 
converted type is MAP_KEY_VALUE.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Reading parquet file generated by pyspark throw `Logical type Null can not be applied to group node` [arrow]

Reply via email to