[GitHub] [arrow] Crystrix opened a new pull request #10443: ARROW-12942: [C++][Compute] Fix incorrect result of Arrow compute hash_min_max with chunked array

GitBox Thu, 03 Jun 2021 05:16:58 -0700


Crystrix opened a new pull request #10443:
URL: https://github.com/apache/arrow/pull/10443



   If there are new groups in the subsequent chunks of a chunked array, the 
result of Arrow compute hash_min_max is incorrect.
   For example, a table with two chunks, the second chunk has a new group key
   ```
   First chunk: {"argument": 1, "key": 0},
   Second chunk: {"argument": 0,  "key": 1}
   ```
   the result of hash_min_max by "key" with such data is
   ```
   [{"min": null, "max": null}, 0],
   [{"min": 0, "max": 0}, 1]
   ```
   But it should be
   ```
   [{"min": 1, "max": 1}, 0],
   [{"min": 0, "max": 0}, 1]
   ```
   
   The root cause is that `has_values_` and `has_nulls_` are `BufferBuilder` 
which has no `_size` and `capacity_` property.  So `MakeResizeImpl` function 
init a `TypedBufferBuilder` with the `BufferBuilder` with `_size` and  
`capacity_` of 0. After the first chunk is processed, in the consumption of the 
second chunk,  `MakeResizeImpl` is called to reserve enough space for the next 
chunk. Then as the `_size` and  `capacity_` are zero, the original 
`BufferBuilder` is overwritten by `Reserve`, and outputs an incorrect result.
   
   This MR separates `has_values_` and `has_nulls_` with a 
`TypedBufferBuilder<bool>` which can keep the `_size` and `capacity_` property. 
Then in the consumption of the second chunk, the space of `has_values_` and 
`has_nulls_` is reserved after the data of the first chunk.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] Crystrix opened a new pull request #10443: ARROW-12942: [C++][Compute] Fix incorrect result of Arrow compute hash_min_max with chunked array

Reply via email to