[
https://issues.apache.org/jira/browse/FLINK-35641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jingsong Lee closed FLINK-35641.
--------------------------------
Fix Version/s: 2.0.0
Resolution: Fixed
fixed in: a54311e89406c88e93b8e93d9ab484dc841bce0a
[~asorokoumov] I just merged this in master, feel free to re-open this Jira if
you want to cherry-pick to 1.x.
> ParquetSchemaConverter should correctly handle field optionality
> ----------------------------------------------------------------
>
> Key: FLINK-35641
> URL: https://issues.apache.org/jira/browse/FLINK-35641
> Project: Flink
> Issue Type: Bug
> Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
> Reporter: Alex Sorokoumov
> Assignee: Alex Sorokoumov
> Priority: Major
> Labels: patch-available, pull-request-available
> Fix For: 2.0.0
>
>
> At the moment,
> [ParquetSchemaConverter|https://github.com/apache/flink/blob/99d6fd3c68f46daf0397a35566414e1d19774c3d/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/utils/ParquetSchemaConverter.java#L64]
> marks all fields as optional. This is not correct in general and especially
> when it comes to handling maps. For example,
> [parquet-tools|https://pypi.org/project/parquet-tools/] breaks on the Parquet
> file produced by
> [ParquetRowDataWriterTest#complexTypeTest|https://github.com/apache/flink/blob/99d6fd3c68f46daf0397a35566414e1d19774c3d/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/row/ParquetRowDataWriterTest.java#L140-L151]:
> {noformat}
> parquet-tools inspect
> /var/folders/sc/k3hr87fj4x169rdq9n107whw0000gp/T/junit14646865447948471989/3b328592-7315-48c6-8fa9-38da4048fb4e
> Traceback (most recent call last):
> File "/Users/asorokoumov/.pyenv/versions/3.12.3/bin/parquet-tools", line 8,
> in <module>
> sys.exit(main())
> ^^^^^^
> File
> "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/cli.py",
> line 26, in main
> args.handler(args)
> File
> "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/commands/inspect.py",
> line 55, in _cli
> _execute_simple(
> File
> "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/commands/inspect.py",
> line 63, in _execute_simple
> pq_file: pq.ParquetFile = pq.ParquetFile(filename)
> ^^^^^^^^^^^^^^^^^^^^^^^^
> File
> "/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/pyarrow/parquet/core.py",
> line 317, in __init__
> self.reader.open(
> File "pyarrow/_parquet.pyx", line 1492, in
> pyarrow._parquet.ParquetReader.open
> File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Map keys must be annotated as required.
> {noformat}
> [The correct thing to
> do|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps]
> is to mark nullable fields as optional, otherwise required.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)