jorisvandenbossche commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1562399718
@weston note that this is not (AFAIU) about custom metadata, but just about
how the arrow schema gets translated to a Parquet schema (or how the arrow
schema gets changed throughout dataset writing).
If we write a single file (directly using the Parquet file writer, not going
through datasets), then a pyarrow field with nullable=False gets translated
into a "required" parquet field:
```python
>>> schema = pa.schema([pa.field("col1", "int64", nullable=True),
pa.field("col2", "int64", nullable=False)])
>>> table = pa.table({"col1": [1, 2, 3], "col2": [2, 3, 4]}, schema=schema)
>>> table.schema
col1: int64
col2: int64 not null
>>> pq.write_table(table, "test_nullability.parquet")
>>> pq.read_metadata("test_nullability.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7f21957c9700>
required group field_id=-1 schema {
optional int64 field_id=-1 col1;
required int64 field_id=-1 col2; # <--- this is "required" instead
of "optional"
}
```
But if we write this as a single file (in a directory) through the dataset
API (so not even using a partitioning column), the non-nullable column is no
longer "required" in the parquet field:
```python
>>> ds.write_dataset(table, "test_dataset_nullability/", format="parquet")
>>> pq.read_metadata("test_dataset_nullability/part-0.parquet").schema
Out[68]:
<pyarrow._parquet.ParquetSchema object at 0x7f219d16cfc0>
required group field_id=-1 schema {
optional int64 field_id=-1 col1;
optional int64 field_id=-1 col2; # <--- no longer "required" !
}
```
So I suppose that somewhere in the dataset writing code path, the schema
looses the field nullability information
> The behavior changed sometime between arrow 7 and 12 since it used to work
with arrow 7
I suppose this is because we now use `pyarrow.dataset.write_dataset` under
the hood in `pq.write_to_dataset`, i.e. going through the dataset API, while
the "legacy" implementation of `pq.write_to_dataset` used a custom
implementation using the direct parquet file writer (and then it comes down to
the difference between those two as illustrated above).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]