rkrishn7 commented on PR #15242:
URL: https://github.com/apache/datafusion/pull/15242#issuecomment-2746563234

   @Omega359 Thanks for your thoughts!
   
   And yes, I agree that the problem isn't with `normalize::convert_batches`. 
It is just being surfaced there.
   
   > I have a thought as to why it's happening. When you fill a column with 
null when that input is missing the column you are not preserving the other 
aspects of the column, specifically nullable. Thus, different batches (for the 
different inputs) can have differing nullable values.
   > 
   > Input 1: 'zz' column missing, is added and set to null (thus nullable = 
true)
   > Input 2: 'zz' column exists, is literal and nullable = false
   > 
   > In this case I think we would have to actually change the column 
nullability for the input that has the column (from non-nullable to nullable) 
or to wrap another projection that accomplishes the same.
   
   Yup, agreed! If one input contains a non-nullable column and the other does 
not, then the final schema will have the field's nullability set to false when 
it really should be true.
   
   I've updated the logic there to account for if the number of occurrences of 
the field across all inputs is less than the number of inputs (i.e. the field 
is missing from one or more inputs). In this case, it must be treated as 
nullable.
   
   Unfortunately I don't think the problem is solved there. Upon further 
investigation, it seems like the problem lies with information loss between the 
logical and physical schemas. Specifically, when constructing the physical 
expression for a literal, the nullability is not determined by the already 
known schema. It is simply based on whether or not the literal is null.
   
   This can be observed by updating the implementation of 
`PhysicalExpr::is_nullable` to return `Ok(true)`. After that, the test suite 
passes successfully.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to