[GitHub] [arrow] westonpace commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

via GitHub Fri, 26 May 2023 15:14:01 -0700


westonpace commented on issue #35730:
URL: https://github.com/apache/arrow/issues/35730#issuecomment-1565014569


   So here is the change that introduced this: 
https://github.com/apache/arrow/issues/31452
   
   Before the change we used to require the schema be specified on the write 
node options.  This was a unnecessary burden when you didn't care about any 
custom field information (since we've already calculated the schema).
   
   > But for what we need to do about this: shouldn't the ProjectNode just try 
to preserve this information for trivial field ref expressions?
   
   I think there is still the problem that we largely ignore nullability.  We 
can't usually assume that all batches will have the same nullability.  For 
example, imagine a scan node where we are scanning two different parquet files. 
 One of the parquet files marks a column as nullable and the other does not.  I 
suppose the correct answer, if Acero were nulalbility-aware and once evolution 
is a little more robust, would be to "evolve" the schema of the file with a 
nullable type to a non-nullable type so that we have a common input schema.
   
   In the meantime, the quickest simple fix to this regression is to allow the 
user to specify an output schema instead of just key / value metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #35730: [Python] write_dataset does not preserve non-nullable columns in schema

Reply via email to