westonpace opened a new pull request, #13172:
URL: https://github.com/apache/arrow/pull/13172
### This PR is in a draft state because it depends on #12289 which is not
yet merged. I would not recommend reviewing it too closely yet.
This PR is the second PR in a series to merge in #12326. The first was
#12872. The new functionality can be found in
7daafeb02906b59c153127463a0c6bed4c888026 and is currently named `RowTable` and
`RowTableMerge`. In addition this PR moves around, and renames, a lot of
existing work revolving around `KeyRowArray`.
The primary goal of the refactor was to improve the readability and clarity
of the code base. I did not make any functional changes to the code and if any
functional changes are suggested which modify existing code I will happily
discuss them here but defer the changes themselves to follow-up PRs. I would
very much appreciate any feedback on naming, making sure we have sufficient
test coverage, and overall layout of the code.
It's a very large change but most of it is moving things around. What I'd
like input on most:
* KeyRowArray -> RowTableImpl, RowArray -> RowTable: This is the most
meaningful name change. The old name made sense because this data is currently
represented physically as an array of rows. However, the data is conceptually
tabular. We are storing rows & columns. In particular, I found it confusing
that KeyColumnArray was a 1D data structure while KeyRowArray was a 2D table
structure.
* Overall structure: I created a new folder `arrow/compute/row` and put
all row-based utilities in here. Most of the files are now marked as
`_internal` and the content in these files is not used outside of
`arrow/compute/row`. The grouper had previously been alongside the kernel code
and it didn't really belong there. It also relies very heavily on the internal
structure of the row encoding.
* Row structure: I documented the file `arrow/compute/row/row_internal.h`
and would appreciate review of the content here.
* Semi-external API: The "external" API now consists of
`arrow/compute/row/grouper.h`, which is more or less unchanged, and
`arrow/compute/encode.h` which is mostly new code from
7daafeb02906b59c153127463a0c6bed4c888026. We have quite a few tests exercising
`Grouper` and I was unable to easily extract them from the aggregate kernel
tests so I left the tests alone. I added tests to the new utilities in
`encode.h`. These utilities are not yet properly external because they depend
on `RowTableImpl` and I would need to migrate them to the PIMPL pattern to
remove this dependency. Since that would be a functional change I think it
would be best to wait.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]