[GitHub] [arrow] jorisvandenbossche commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

via GitHub Thu, 25 May 2023 00:53:09 -0700


jorisvandenbossche commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1562446055


   Digging a bit further, this nullable field information is lost in acero's 
ProjectNode (the `FileSystemDataset::Write` call is essentially a combination 
of source+project+filter+write nodes). 
   
   Small reproducer in python:
   
   ```python
   from pyarrow.acero import Declaration, TableSourceNodeOptions, 
ProjectNodeOptions, field
   
   schema = pa.schema([pa.field("col1", "int64", nullable=True), 
pa.field("col2", "int64", nullable=False)])
   table = pa.table({"col1": [1, 2, 3], "col2": [2, 3, 4]}, schema=schema)
   table_source = Declaration("table_source", 
options=TableSourceNodeOptions(table))
   project = Declaration("project", ProjectNodeOptions([field("col1"), 
field("col2")]))
   decl = Declaration.from_sequence([table_source, project])
   
   >>> table.schema
   col1: int64
   col2: int64 not null
   >>> decl.to_table().schema
   col1: int64
   col2: int64
   ```
   
   This happens because the ProjectNode naively recreates the schema from the 
names/exprs, ignoring the field information of the original input schema:
   
   
https://github.com/apache/arrow/blob/6bd31f37ae66bd35594b077cb2f830be57e08acd/cpp/src/arrow/acero/project_node.cc#L64-L75
   
   So this only preserves the type of the original input schema, but will 
ignore any nullable flag or field metadata information (and then we only do 
some special code to preserve the custom metadata of the full schema)
   
   @westonpace rereading your original comment, while your explanation first 
focused on the schema metadata, you actually also already said essentially the 
above:
   
   > That being said, `custom_metadata` may not be sufficient here. It only 
allows you to specify the key/value metadata for the schema, and not individual 
field metadata.
   
   But for what we need to do about this: shouldn't the ProjectNode just try to 
preserve this information for trivial field ref expressions?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Reply via email to