viirya opened a new issue, #1030:
URL: https://github.com/apache/datafusion-comet/issues/1030

   ### Describe the bug
   
   We found an interesting bug recently. For some cases, `Dataset.show` and 
`Dataset.collectAsList` return different results.
   We investigated the bug and found that it is due to the implementation 
of`take_bytes`.
   
   In the cases, Comet reads a dictionary array of string. It unpacks 
dictionary array to string array. In a query where `TopK` operator is used, the 
operator will store input arrays into internal store and emit after all inputs 
are consumed. In Comet, the output arrays from scan reuse same buffers across 
batches. For operators that cache input arrays, Comet will do deep copy on 
these arrays.
   
   However, when unpacking dictionary array to string array by calling 
`take_bytes `, if the indices array has no null, `take_bytes` kernel simply 
takes a full slice of the null buffer of indices (i.e., reusing it) as the null 
buffer of output array. So in the next batch, once the null buffer is updated 
(as Comet reuses underlying buffer), the stored array in `TopK` operator is 
also changed. It makes the query result indeterministic.
   
   Consider the semantics of `take` kernel, its output array should not reuse 
input array. The current behavior looks incorrect.
   
   We are going to fix it at the arrow-rs.
   
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to