[ 
https://issues.apache.org/jira/browse/ARROW-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583529#comment-17583529
 ] 

Rácz Dániel commented on ARROW-15899:
-------------------------------------

Hi, is there any chance that this bug will get fixed anytime soon?

> [C++] Parquet writes broken file or incorrect data when nullable=False
> ----------------------------------------------------------------------
>
>                 Key: ARROW-15899
>                 URL: https://issues.apache.org/jira/browse/ARROW-15899
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>    Affects Versions: 6.0.0, 6.0.1, 7.0.0, 7.0.2
>            Reporter: Rácz Dániel
>            Priority: Major
>
> In such cases, when trying to write a pyarrow table to parquet with provided 
> schema, and the provided schema contains a field with `nullable=false`, but 
> contains an actual null value , the resulting parquet either
>  * cannot be read or
>  * the columns are somewhat get `pushed up`, and the whole table becomes 
> inconsistent. The column changes by seemingly dropping the null value, and 
> pushing together the complete dataset based on the provided row_group_size 
> (and starting over from the start when runs out of values). Different row 
> group sizes will lead to different results as well.  This off-by-one problem 
> is persistent in a single row group, the next could be perfectly fine if it 
> contains no null values.
>  
> I believe, none of these behaviours are intentional, but they easily overseen 
> by the user as one might think that providing a schema with constraints would 
> lead to at least a warning/ (or better) an exception when writing the file. 
> Using provided validation methods also see no problem with this particular 
> problem.
> You can find a snippet below explaining this weird behaviour.
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> field_name = 'a_string'
> schema = pa.schema([
>     pa.field(name=field_name, type=pa.string(), nullable=False) # not nullable
> ])
> # Arrow Columnar Format doesn't care if a non-nullable field holds a null
> t_out = pa.table([['0', '1', None, '3', '4']], schema=schema) # OK
> t_out.validate(full=True) # OK
> t_out.cast(schema, safe=True) # OK
> # Parquet writing does not raise, but silently kills the null string
> # because of the REQUIRED-ness of the field in the schema.
> # Then you either cannot read the parquet back, or the returned data
> # is invented, depending on the written row_group_size.
> pq.write_table(t_out, where='pq_1', row_group_size=1)
> pq.read_table('pq_1')
> # -> OSError: Unexpected end of stream
> pq.write_table(t_out, where='pq_2', row_group_size=2)
> pq.read_table('pq_2')
> # -> OSError: Unexpected end of stream
> # -> or sometimes: pyarrow.lib.ArrowInvalid: Index not in dictionary bounds
> pq.write_table(t_out, where='pq_3', row_group_size=3)
> print(pq.read_table('pq_3')[field_name])
> # -> [["0","1","0"],["3","4"]]
> pq.write_table(t_out, where='pq_4', row_group_size=4)
> print(pq.read_table('pq_4')[field_name])
> # -> [["0","1","3","0"],["4"]]
> pq.write_table(t_out, where='pq_5', row_group_size=5)
> print(pq.read_table('pq_5')[field_name])
> # -> [["0","1","3","4","0"]]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to