[GitHub] [arrow] jorisvandenbossche commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

via GitHub Wed, 24 May 2023 08:28:24 -0700


jorisvandenbossche commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1561372323


   @ildipo Thanks for the report!
   
   The parquet format doesn't has such a flag directly, but it stores nulls as 
a repetition level, and you can indicate a field to be "required", and it seems 
that for reading writing individual tables to Parquet files, we translate "not 
null" into required parquet types, and also when reading convert a required 
field back to "not null":
   
   ```python
   >>> pq.write_table(table, "test_nullability.parquet")
   >>> pq.read_metadata("test_nullability.parquet").schema
   <pyarrow._parquet.ParquetSchema object at 0x7f21b778fec0>
   required group field_id=-1 schema {
     required int64 field_id=-1 x;
     optional int64 field_id=-1 y;
     optional int32 field_id=-1 date (Date);
   }
   >>> pq.read_table("test_nullability.parquet").schema
   Out[28]: 
   x: int64 not null
   y: int64
   date: date32[day]
   ```
   
   So it seems this is supported in the Parquet module itself, and so this 
should be something in the dataset API that loses this information. Quick guess 
is that it has to do with partitioning:
   
   ```python
   >>> pq.write_to_dataset(table, "test_dataset_nullability"')
   # reading directory -> lost "not null"
   >>> ds.dataset("test_dataset_nullability/", format="parquet").schema
   x: int64
   y: int64
   date: date32[day]
   
   # reading single file -> preserved "not null"
   >>> ds.dataset("test_nullability.parquet", format="parquet").schema
   Out[37]: 
   x: int64 not null
   y: int64
   date: date32[day]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Reply via email to