westonpace commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1561496846
Yes, write_dataset is a bit tricky when it comes to schema information. If
the input is multiple tables, then write_dataset is probably going to be
combining them into a single output table, so which metadata do we use? What
the write node does today is allow a `custom_metadata` to be supplied, in
addition to the dataset, which it will attach to all written batches.
Then we have a bit of a hack in place today for "If the input is a single
table then preserve the metadata". This is in `FileSystemDataset::Write` which
is what pyarrow is using today:
```
// The projected_schema is currently used by pyarrow to preserve the
custom metadata
// when reading from a single input file.
const auto& custom_metadata =
scanner->options()->projected_schema->metadata();
```
This `custom_metadata` is not currently exposed to `pyarrow`. So I think we
probably want to add it.
That being said, `custom_metadata` may not be sufficient here. It only
allows you to specify the key/value metadata for the schema, and not individual
field metadata. So we'd need to change that too. All put together we have:
* Change `WriteNodeOptions::custom_metadata` to `WriteNodeOptions::schema`
* Do one of the following:
* Add `custom_schema` to `FileSystemDataset::Write`
* Change `pyarrow` to use Acero (and WriteNodeOptions) directly instead
of `FileSystemDataset::Write`
* Add pyarrow bindings for whichever approach we did in the previous step
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]