lidavidm commented on pull request #12706:
URL: https://github.com/apache/arrow/pull/12706#issuecomment-1086345959


   So the Parquet tests are now failing, example: 
https://github.com/apache/arrow/blob/623a15e7f7a45578733956714c8dddcc9f66f015/cpp/src/parquet/arrow/arrow_reader_writer_test.cc#L2816-L2830
   
   It seems we (understandably) ignore the child array slots corresponding to 
top-level nulls, either on writing or on reading Parquet, but of course if the 
child field is marked non-nullable this becomes a problem now that we're 
validating this fact. 
   
   Example in Python:
   
   ```python
   >>> import pyarrow as pa
   >>> import pyarrow.parquet as pq
   >>> ints = pa.array([1,2,3,4,5,6,7,8])
   >>> struct = pa.StructArray.from_arrays([ints], fields=[pa.field("ints", 
pa.int64(), nullable=False)], mask=pa.array([True, True, False, False, True, 
True, True, True]))
   >>> struct
   <pyarrow.lib.StructArray object at 0x7fe5bc16a440>
   -- is_valid:
     [
       false,
       false,
       true,
       true,
       false,
       false,
       false,
       false
     ]
   -- child 0 type: int64
     [
       1,
       2,
       3,
       4,
       5,
       6,
       7,
       8
     ]
   >>> table = pa.table([struct], names=["struct"])
   >>> pq.write_table(table, "test.parquet")
   >>> pq.read_table("test.parquet")
   pyarrow.Table
   struct: struct<ints: int64 not null>
     child 0, ints: int64 not null
   ----
   struct: [  -- is_valid:      [
         false,
         false,
         true,
         true,
         false,
         false,
         false,
         false
       ]  -- child 0 type: int64
       [
         null,
         null,
         3,
         4,
         null,
         null,
         null,
         null
       ]]
   >>> table
   pyarrow.Table
   struct: struct<ints: int64 not null>
     child 0, ints: int64 not null
   ----
   struct: [  -- is_valid:      [
         false,
         false,
         true,
         true,
         false,
         false,
         false,
         false
       ]  -- child 0 type: int64
       [
         1,
         2,
         3,
         4,
         5,
         6,
         7,
         8
       ]]
   ```
   
   It _seems_ this is done on write, looking at the underlying column:
   
   ```python
   >>> pq.read_metadata("test.parquet").row_group(0).column(0).statistics
   <pyarrow._parquet.Statistics object at 0x7fe5bc1d74c0>
     has_min_max: True
     min: 3
     max: 4
     null_count: 6
     distinct_count: 0
     num_values: 2
     physical_type: INT64
     logical_type: None
     converted_type (legacy): NONE
   ```
   
   @pitrou @emkornfield am I understanding this right? It seems we either need 
to write out the values of child arrays behind null slots, or fill in some 
value on read?
   
   Just for fun, Feather appears to handle this correctly:
   
   ```python
   >>> import pyarrow as pa
   >>> import pyarrow.feather as pf
   >>> ints = pa.array([1,2,3,4,5,6,7,8])
   >>> struct = pa.StructArray.from_arrays([ints], fields=[pa.field("ints", 
pa.int64(), nullable=False)], mask=pa.array([True, True, False, False, True, 
True, True, True]))
   >>> table = pa.table([struct], names=["struct"])
   >>> pf.write_feather(table, "test.feather")
   >>> pf.read_table("test.feather")
   pyarrow.Table
   struct: struct<ints: int64 not null>
     child 0, ints: int64 not null
   ----
   struct: [  -- is_valid:      [
         false,
         false,
         true,
         true,
         false,
         false,
         false,
         false
       ]  -- child 0 type: int64
       [
         1,
         2,
         3,
         4,
         5,
         6,
         7,
         8
       ]]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to