Rácz Dániel created ARROW-15899:
-----------------------------------

             Summary: [Parquet] Writes broken file or incorrect data when 
nullable=False
                 Key: ARROW-15899
                 URL: https://issues.apache.org/jira/browse/ARROW-15899
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 7.0.0, 7.0.1
            Reporter: Rácz Dániel


In such cases, when trying to write a pyarrow table to parquet with provided 
schema, and the provided schema contains a field with `nullable=false`, but 
contains an actual null value , the resulting parquet either
 * cannot be read or
 * the columns are somewhat get `pushed up`, and the whole table becomes 
inconsistent. The column changes by seemingly dropping the null value, and 
pushing together the complete dataset based on the provided row_group_size (and 
starting over from the start when runs out of values). Different row group 
sizes will lead to different results as well.  This off-by-one problem is 
persistent in a single row group, the next could be perfectly fine if it 
contains no null values.

 

I believe, none of these behaviours are intentional, but they easily overseen 
by the user as one might think that providing a schema with constraints would 
lead to at least a warning/ (or better) an exception when writing the file. 
Using provided validation methods also see no problem with this particular 
problem.




You can find a snippet below explaining this weird behaviour.


import pyarrow as pa
import pyarrow.parquet as pq

field_name = 'a_string'
schema = pa.schema([
    pa.field(name=field_name, type=pa.string(), nullable=False) # not nullable
])

# Arrow Columnar Format doesn't care if a non-nullable field holds a null
t_out = pa.table([['0', '1', None, '3', '4']], schema=schema) # OK
t_out.validate(full=True) # OK
t_out.cast(schema, safe=True) # OK


# Parquet writing does not raise, but silently kills the null string
# because of the REQUIRED-ness of the field in the schema.

# Then you either cannot read the parquet back, or the returned data
# is invented, depending on the written row_group_size.

pq.write_table(t_out, where='pq_1', row_group_size=1)
pq.read_table('pq_1')
# -> OSError: Unexpected end of stream

pq.write_table(t_out, where='pq_2', row_group_size=2)
pq.read_table('pq_2')
# -> OSError: Unexpected end of stream
# -> or sometimes: pyarrow.lib.ArrowInvalid: Index not in dictionary bounds

pq.write_table(t_out, where='pq_3', row_group_size=3)
print(pq.read_table('pq_3')[field_name])
# -> [["0","1","0"],["3","4"]]

pq.write_table(t_out, where='pq_4', row_group_size=4)
print(pq.read_table('pq_4')[field_name])
# -> [["0","1","3","0"],["4"]]

pq.write_table(t_out, where='pq_5', row_group_size=5)
print(pq.read_table('pq_5')[field_name])
# -> [["0","1","3","4","0"]]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to