viirya opened a new issue, #1030: URL: https://github.com/apache/datafusion-comet/issues/1030
### Describe the bug We found an interesting bug recently. For some cases, `Dataset.show` and `Dataset.collectAsList` return different results. We investigated the bug and found that it is due to the implementation of`take_bytes`. In the cases, Comet reads a dictionary array of string. It unpacks dictionary array to string array. In a query where `TopK` operator is used, the operator will store input arrays into internal store and emit after all inputs are consumed. In Comet, the output arrays from scan reuse same buffers across batches. For operators that cache input arrays, Comet will do deep copy on these arrays. However, when unpacking dictionary array to string array by calling `take_bytes `, if the indices array has no null, `take_bytes` kernel simply takes a full slice of the null buffer of indices (i.e., reusing it) as the null buffer of output array. So in the next batch, once the null buffer is updated (as Comet reuses underlying buffer), the stored array in `TopK` operator is also changed. It makes the query result indeterministic. Consider the semantics of `take` kernel, its output array should not reuse input array. The current behavior looks incorrect. We are going to fix it at the arrow-rs. ### Steps to reproduce _No response_ ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
