rkrishn7 commented on PR #15242: URL: https://github.com/apache/datafusion/pull/15242#issuecomment-2746563234
@Omega359 Thanks for your thoughts! And yes, I agree that the problem isn't with `normalize::convert_batches`. It is just being surfaced there. > I have a thought as to why it's happening. When you fill a column with null when that input is missing the column you are not preserving the other aspects of the column, specifically nullable. Thus, different batches (for the different inputs) can have differing nullable values. > > Input 1: 'zz' column missing, is added and set to null (thus nullable = true) > Input 2: 'zz' column exists, is literal and nullable = false > > In this case I think we would have to actually change the column nullability for the input that has the column (from non-nullable to nullable) or to wrap another projection that accomplishes the same. Yup, agreed! If one input contains a non-nullable column and the other does not, then the final schema will have the field's nullability set to false when it really should be true. I've updated the logic there to account for if the number of occurrences of the field across all inputs is less than the number of inputs (i.e. the field is missing from one or more inputs). In this case, it must be treated as nullable. Unfortunately I don't think the problem is solved there. Upon further investigation, it seems like the problem lies with information loss between the logical and physical schemas. Specifically, when constructing the physical expression for a literal, the nullability is not determined by the already known schema. It is simply based on whether or not the literal is null. This can be observed by updating the implementation of `PhysicalExpr::is_nullable` to return `Ok(true)`. After that, the test suite passes successfully. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org