[
https://issues.apache.org/jira/browse/ARROW-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583529#comment-17583529
]
Rácz Dániel commented on ARROW-15899:
-------------------------------------
Hi, is there any chance that this bug will get fixed anytime soon?
> [C++] Parquet writes broken file or incorrect data when nullable=False
> ----------------------------------------------------------------------
>
> Key: ARROW-15899
> URL: https://issues.apache.org/jira/browse/ARROW-15899
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Parquet
> Affects Versions: 6.0.0, 6.0.1, 7.0.0, 7.0.2
> Reporter: Rácz Dániel
> Priority: Major
>
> In such cases, when trying to write a pyarrow table to parquet with provided
> schema, and the provided schema contains a field with `nullable=false`, but
> contains an actual null value , the resulting parquet either
> * cannot be read or
> * the columns are somewhat get `pushed up`, and the whole table becomes
> inconsistent. The column changes by seemingly dropping the null value, and
> pushing together the complete dataset based on the provided row_group_size
> (and starting over from the start when runs out of values). Different row
> group sizes will lead to different results as well. This off-by-one problem
> is persistent in a single row group, the next could be perfectly fine if it
> contains no null values.
>
> I believe, none of these behaviours are intentional, but they easily overseen
> by the user as one might think that providing a schema with constraints would
> lead to at least a warning/ (or better) an exception when writing the file.
> Using provided validation methods also see no problem with this particular
> problem.
> You can find a snippet below explaining this weird behaviour.
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> field_name = 'a_string'
> schema = pa.schema([
> pa.field(name=field_name, type=pa.string(), nullable=False) # not nullable
> ])
> # Arrow Columnar Format doesn't care if a non-nullable field holds a null
> t_out = pa.table([['0', '1', None, '3', '4']], schema=schema) # OK
> t_out.validate(full=True) # OK
> t_out.cast(schema, safe=True) # OK
> # Parquet writing does not raise, but silently kills the null string
> # because of the REQUIRED-ness of the field in the schema.
> # Then you either cannot read the parquet back, or the returned data
> # is invented, depending on the written row_group_size.
> pq.write_table(t_out, where='pq_1', row_group_size=1)
> pq.read_table('pq_1')
> # -> OSError: Unexpected end of stream
> pq.write_table(t_out, where='pq_2', row_group_size=2)
> pq.read_table('pq_2')
> # -> OSError: Unexpected end of stream
> # -> or sometimes: pyarrow.lib.ArrowInvalid: Index not in dictionary bounds
> pq.write_table(t_out, where='pq_3', row_group_size=3)
> print(pq.read_table('pq_3')[field_name])
> # -> [["0","1","0"],["3","4"]]
> pq.write_table(t_out, where='pq_4', row_group_size=4)
> print(pq.read_table('pq_4')[field_name])
> # -> [["0","1","3","0"],["4"]]
> pq.write_table(t_out, where='pq_5', row_group_size=5)
> print(pq.read_table('pq_5')[field_name])
> # -> [["0","1","3","4","0"]]{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)