viirya opened a new issue, #6617:
URL: https://github.com/apache/arrow-rs/issues/6617

   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   
   In Comet, we found an interesting bug recently. For some cases, 
`Dataset.show` and `Dataset.collectAsList` return different results.
   We investigated the bug and found that it is due to the implementation 
of`take_bytes`.
   
   In the cases, Comet reads a dictionary array of string. It unpacks 
dictionary array to string array. In a query where `TopK` operator is used, the 
operator will store input arrays into internal store and emit after all inputs 
are consumed. In Comet, the output arrays from scan reuse same buffers across 
batches. For operators that cache input arrays, Comet will do deep copy on 
these arrays.
   
   However, when unpacking dictionary array to string array by calling 
`take_bytes `, if the indices array has no null, `take_bytes` kernel simply 
takes a full slice of the null buffer of indices (i.e., reusing it) as the null 
buffer of output array. So in the next batch, once the null buffer is updated 
(as Comet reuses underlying buffer), the stored array in `TopK` operator is 
also changed. It makes the query result indeterministic.
   
   Consider the semantics of `take` kernel, its output array should not reuse 
input array. The current behavior looks incorrect.
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to