viirya opened a new issue, #6617: URL: https://github.com/apache/arrow-rs/issues/6617
**Describe the bug** <!-- A clear and concise description of what the bug is. --> In Comet, we found an interesting bug recently. For some cases, `Dataset.show` and `Dataset.collectAsList` return different results. We investigated the bug and found that it is due to the implementation of`take_bytes`. In the cases, Comet reads a dictionary array of string. It unpacks dictionary array to string array. In a query where `TopK` operator is used, the operator will store input arrays into internal store and emit after all inputs are consumed. In Comet, the output arrays from scan reuse same buffers across batches. For operators that cache input arrays, Comet will do deep copy on these arrays. However, when unpacking dictionary array to string array by calling `take_bytes `, if the indices array has no null, `take_bytes` kernel simply takes a full slice of the null buffer of indices (i.e., reusing it) as the null buffer of output array. So in the next batch, once the null buffer is updated (as Comet reuses underlying buffer), the stored array in `TopK` operator is also changed. It makes the query result indeterministic. Consider the semantics of `take` kernel, its output array should not reuse input array. The current behavior looks incorrect. **To Reproduce** <!-- Steps to reproduce the behavior: --> **Expected behavior** <!-- A clear and concise description of what you expected to happen. --> **Additional context** <!-- Add any other context about the problem here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
