Crystrix opened a new pull request #10443:
URL: https://github.com/apache/arrow/pull/10443
If there are new groups in the subsequent chunks of a chunked array, the
result of Arrow compute hash_min_max is incorrect.
For example, a table with two chunks, the second chunk has a new group key
```
First chunk: {"argument": 1, "key": 0},
Second chunk: {"argument": 0, "key": 1}
```
the result of hash_min_max by "key" with such data is
```
[{"min": null, "max": null}, 0],
[{"min": 0, "max": 0}, 1]
```
But it should be
```
[{"min": 1, "max": 1}, 0],
[{"min": 0, "max": 0}, 1]
```
The root cause is that `has_values_` and `has_nulls_` are `BufferBuilder`
which has no `_size` and `capacity_` property. So `MakeResizeImpl` function
init a `TypedBufferBuilder` with the `BufferBuilder` with `_size` and
`capacity_` of 0. After the first chunk is processed, in the consumption of the
second chunk, `MakeResizeImpl` is called to reserve enough space for the next
chunk. Then as the `_size` and `capacity_` are zero, the original
`BufferBuilder` is overwritten by `Reserve`, and outputs an incorrect result.
This MR separates `has_values_` and `has_nulls_` with a
`TypedBufferBuilder<bool>` which can keep the `_size` and `capacity_` property.
Then in the consumption of the second chunk, the space of `has_values_` and
`has_nulls_` is reserved after the data of the first chunk.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]