jorisvandenbossche commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1562399718

   @weston note that this is not (AFAIU) about custom metadata, but just about 
how the arrow schema gets translated to a Parquet schema (or how the arrow 
schema gets changed throughout dataset writing).
   
   If we write a single file (directly using the Parquet file writer, not going 
through datasets), then a pyarrow field with nullable=False gets translated 
into a "required" parquet field:
   
   ```python
   >>> schema = pa.schema([pa.field("col1", "int64", nullable=True), 
pa.field("col2", "int64", nullable=False)])
   >>> table = pa.table({"col1": [1, 2, 3], "col2": [2, 3, 4]}, schema=schema)
   >>> table.schema
   col1: int64
   col2: int64 not null
   
   >>> pq.write_table(table, "test_nullability.parquet")
   >>> pq.read_metadata("test_nullability.parquet").schema
   <pyarrow._parquet.ParquetSchema object at 0x7f21957c9700>
   required group field_id=-1 schema {
     optional int64 field_id=-1 col1;
     required int64 field_id=-1 col2;       # <--- this is "required" instead 
of "optional"
   }
   ```
   
   But if we write this as a single file (in a directory) through the dataset 
API (so not even using a partitioning column), the non-nullable column is no 
longer "required" in the parquet field:
   
   ```python
   >>> ds.write_dataset(table, "test_dataset_nullability/", format="parquet")
   >>> pq.read_metadata("test_dataset_nullability/part-0.parquet").schema
   Out[68]: 
   <pyarrow._parquet.ParquetSchema object at 0x7f219d16cfc0>
   required group field_id=-1 schema {
     optional int64 field_id=-1 col1;
     optional int64 field_id=-1 col2;        # <--- no longer "required" !
   }
   ```
   
   So I suppose that somewhere in the dataset writing code path, the schema 
looses the field nullability information
   
   > The behavior changed sometime between arrow 7 and 12 since it used to work 
with arrow 7
   
   I suppose this is because we now use `pyarrow.dataset.write_dataset` under 
the hood in `pq.write_to_dataset`, i.e. going through the dataset API, while 
the "legacy" implementation of `pq.write_to_dataset` used a custom 
implementation using the direct parquet file writer (and then it comes down to 
the difference between those two as illustrated above).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to