[GitHub] [arrow] westonpace commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

via GitHub Wed, 24 May 2023 09:14:31 -0700


westonpace commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1561496846


   Yes, write_dataset is a bit tricky when it comes to schema information.  If 
the input is multiple tables, then write_dataset is probably going to be 
combining them into a single output table, so which metadata do we use?  What 
the write node does today is allow a `custom_metadata` to be supplied, in 
addition to the dataset, which it will attach to all written batches.
   
   Then we have a bit of a hack in place today for "If the input is a single 
table then preserve the metadata".  This is in `FileSystemDataset::Write` which 
is what pyarrow is using today:
   
   ```
     // The projected_schema is currently used by pyarrow to preserve the 
custom metadata
     // when reading from a single input file.
     const auto& custom_metadata = 
scanner->options()->projected_schema->metadata();
   ```
   
   This `custom_metadata` is not currently exposed to `pyarrow`.  So I think we 
probably want to add it.
   
   That being said, `custom_metadata` may not be sufficient here.  It only 
allows you to specify the key/value metadata for the schema, and not individual 
field metadata.  So we'd need to change that too.  All put together we have:
   
    * Change `WriteNodeOptions::custom_metadata` to `WriteNodeOptions::schema`
    * Do one of the following:
      * Add `custom_schema` to `FileSystemDataset::Write`
      * Change `pyarrow` to use Acero (and WriteNodeOptions) directly instead 
of `FileSystemDataset::Write`
    * Add pyarrow bindings for whichever approach we did in the previous step
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Reply via email to