jorisvandenbossche commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1561372323
@ildipo Thanks for the report!
The parquet format doesn't has such a flag directly, but it stores nulls as
a repetition level, and you can indicate a field to be "required", and it seems
that for reading writing individual tables to Parquet files, we translate "not
null" into required parquet types, and also when reading convert a required
field back to "not null":
```python
>>> pq.write_table(table, "test_nullability.parquet")
>>> pq.read_metadata("test_nullability.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7f21b778fec0>
required group field_id=-1 schema {
required int64 field_id=-1 x;
optional int64 field_id=-1 y;
optional int32 field_id=-1 date (Date);
}
>>> pq.read_table("test_nullability.parquet").schema
Out[28]:
x: int64 not null
y: int64
date: date32[day]
```
So it seems this is supported in the Parquet module itself, and so this
should be something in the dataset API that loses this information. Quick guess
is that it has to do with partitioning:
```python
>>> pq.write_to_dataset(table, "test_dataset_nullability"')
# reading directory -> lost "not null"
>>> ds.dataset("test_dataset_nullability/", format="parquet").schema
x: int64
y: int64
date: date32[day]
# reading single file -> preserved "not null"
>>> ds.dataset("test_nullability.parquet", format="parquet").schema
Out[37]:
x: int64 not null
y: int64
date: date32[day]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]