westonpace commented on issue #35730: URL: https://github.com/apache/arrow/issues/35730#issuecomment-1565014569
So here is the change that introduced this: https://github.com/apache/arrow/issues/31452 Before the change we used to require the schema be specified on the write node options. This was a unnecessary burden when you didn't care about any custom field information (since we've already calculated the schema). > But for what we need to do about this: shouldn't the ProjectNode just try to preserve this information for trivial field ref expressions? I think there is still the problem that we largely ignore nullability. We can't usually assume that all batches will have the same nullability. For example, imagine a scan node where we are scanning two different parquet files. One of the parquet files marks a column as nullable and the other does not. I suppose the correct answer, if Acero were nulalbility-aware and once evolution is a little more robust, would be to "evolve" the schema of the file with a nullable type to a non-nullable type so that we have a common input schema. In the meantime, the quickest simple fix to this regression is to allow the user to specify an output schema instead of just key / value metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
