jorisvandenbossche commented on issue #39540:
URL: https://github.com/apache/arrow/issues/39540#issuecomment-1897992749
Yes, I can confirm the error, also with the latest development version of
Arrow.
Opening the file and printing the schema there:
```
In [63]: import fastparquet
In [64]: f =
fastparquet.ParquetFile("../Downloads/sample/part-00000-e69412f4-236c-436a-a4cd-89318d2aaa3d-c000.snappy.parquet")
In [65]: print(f.schema.text)
- spark_schema:
| - id: BYTE_ARRAY, STRING, UTF8, OPTIONAL
| - email_sha256: BYTE_ARRAY, STRING, UTF8, OPTIONAL
| - params: MAP, MAP, OPTIONAL
| - map: UNKNOWN, MAP_KEY_VALUE, REPEATED
| | - key: BYTE_ARRAY, STRING, UTF8, REQUIRED
| - value: BYTE_ARRAY, STRING, UTF8, OPTIONAL
- master_id: BYTE_ARRAY, STRING, UTF8, OPTIONAL
```
I suppose the error is coming from the `MAP` type column "params" (since the
other columns are simple, non-nested columns).
According to the spec
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps),
map columns consist of a 3-level structure, as is also shown above. The second
(middle) level "named `key_value`, must be a repeated group".
So as a start, it seems that Spark doesn't use the correct name, since above
that level is named "map" and not "key_value".
But I assume the issue is that this middle level is annotated with the
"UNKNOWN" logical type. And in our code, we don't allow a group node to have
such logical type (to be my understanding, this annotation also doesn't make
sense, as it should be used to indicate that all values in that column are
null).
Looking at the parquet-mr repo, it seems this was fixed in version 1.12
(https://github.com/apache/parquet-mr/pull/798 /
https://issues.apache.org/jira/browse/PARQUET-1879). So if you update your
spark (and parquet-mr) version, and write the file again, then I assume it will
be readable by Arrow.
Now, this still means that we cannot read files written by parquet-mr <
1.12. I assume also on the Arrow side it should be possible to add some
workaround to ignore the UNKNOWN logical type of a group node _if_ the
converted type is MAP_KEY_VALUE.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]