lidavidm commented on pull request #12706: URL: https://github.com/apache/arrow/pull/12706#issuecomment-1086345959
So the Parquet tests are now failing, example: https://github.com/apache/arrow/blob/623a15e7f7a45578733956714c8dddcc9f66f015/cpp/src/parquet/arrow/arrow_reader_writer_test.cc#L2816-L2830 It seems we (understandably) ignore the child array slots corresponding to top-level nulls, either on writing or on reading Parquet, but of course if the child field is marked non-nullable this becomes a problem now that we're validating this fact. Example in Python: ```python >>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> ints = pa.array([1,2,3,4,5,6,7,8]) >>> struct = pa.StructArray.from_arrays([ints], fields=[pa.field("ints", pa.int64(), nullable=False)], mask=pa.array([True, True, False, False, True, True, True, True])) >>> struct <pyarrow.lib.StructArray object at 0x7fe5bc16a440> -- is_valid: [ false, false, true, true, false, false, false, false ] -- child 0 type: int64 [ 1, 2, 3, 4, 5, 6, 7, 8 ] >>> table = pa.table([struct], names=["struct"]) >>> pq.write_table(table, "test.parquet") >>> pq.read_table("test.parquet") pyarrow.Table struct: struct<ints: int64 not null> child 0, ints: int64 not null ---- struct: [ -- is_valid: [ false, false, true, true, false, false, false, false ] -- child 0 type: int64 [ null, null, 3, 4, null, null, null, null ]] >>> table pyarrow.Table struct: struct<ints: int64 not null> child 0, ints: int64 not null ---- struct: [ -- is_valid: [ false, false, true, true, false, false, false, false ] -- child 0 type: int64 [ 1, 2, 3, 4, 5, 6, 7, 8 ]] ``` It _seems_ this is done on write, looking at the underlying column: ```python >>> pq.read_metadata("test.parquet").row_group(0).column(0).statistics <pyarrow._parquet.Statistics object at 0x7fe5bc1d74c0> has_min_max: True min: 3 max: 4 null_count: 6 distinct_count: 0 num_values: 2 physical_type: INT64 logical_type: None converted_type (legacy): NONE ``` @pitrou @emkornfield am I understanding this right? It seems we either need to write out the values of child arrays behind null slots, or fill in some value on read? Just for fun, Feather appears to handle this correctly: ```python >>> import pyarrow as pa >>> import pyarrow.feather as pf >>> ints = pa.array([1,2,3,4,5,6,7,8]) >>> struct = pa.StructArray.from_arrays([ints], fields=[pa.field("ints", pa.int64(), nullable=False)], mask=pa.array([True, True, False, False, True, True, True, True])) >>> table = pa.table([struct], names=["struct"]) >>> pf.write_feather(table, "test.feather") >>> pf.read_table("test.feather") pyarrow.Table struct: struct<ints: int64 not null> child 0, ints: int64 not null ---- struct: [ -- is_valid: [ false, false, true, true, false, false, false, false ] -- child 0 type: int64 [ 1, 2, 3, 4, 5, 6, 7, 8 ]] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
