jorisvandenbossche commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1562446055
Digging a bit further, this nullable field information is lost in acero's
ProjectNode (the `FileSystemDataset::Write` call is essentially a combination
of source+project+filter+write nodes).
Small reproducer in python:
```python
from pyarrow.acero import Declaration, TableSourceNodeOptions,
ProjectNodeOptions, field
schema = pa.schema([pa.field("col1", "int64", nullable=True),
pa.field("col2", "int64", nullable=False)])
table = pa.table({"col1": [1, 2, 3], "col2": [2, 3, 4]}, schema=schema)
table_source = Declaration("table_source",
options=TableSourceNodeOptions(table))
project = Declaration("project", ProjectNodeOptions([field("col1"),
field("col2")]))
decl = Declaration.from_sequence([table_source, project])
>>> table.schema
col1: int64
col2: int64 not null
>>> decl.to_table().schema
col1: int64
col2: int64
```
This happens because the ProjectNode naively recreates the schema from the
names/exprs, ignoring the field information of the original input schema:
https://github.com/apache/arrow/blob/6bd31f37ae66bd35594b077cb2f830be57e08acd/cpp/src/arrow/acero/project_node.cc#L64-L75
So this only preserves the type of the original input schema, but will
ignore any nullable flag or field metadata information (and then we only do
some special code to preserve the custom metadata of the full schema)
@westonpace rereading your original comment, while your explanation first
focused on the schema metadata, you actually also already said essentially the
above:
> That being said, `custom_metadata` may not be sufficient here. It only
allows you to specify the key/value metadata for the schema, and not individual
field metadata.
But for what we need to do about this: shouldn't the ProjectNode just try to
preserve this information for trivial field ref expressions?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]