michalursa opened a new pull request #9768: URL: https://github.com/apache/arrow/pull/9768
This is the draft version of the code implementing functionality for mapping arbitrary set of input columns considered a key in grouping operation into a vector containing integer group identifiers (same combinations of input key columns get same ids). I will continue working on it and updating it with: - integration with initial hash group by implementation in Arrow project, once it is finished and merged into master - unit tests - documentation At this point group ids, row ids, offsets, hash values are 32-bit. The overflow checks are missing in current version and still need to be fixed. The entry point for id mapping is GroupBy class. It uses three main modules: storage defined in groupby_storage* files, hash defined in groupby_hash* files and hash table defined in groupby_map* files. Key values stored with the hash table are row oriented. Storage part of the code defines functions converting from column oriented storage to row oriented storage and back. It also implements comparison and appending keys to the incremental store. I plan to add design doc in a form of a readme file later on. The individual modules and functions present here have been tested with unit tests and are passing them but unit tests are not included in this change yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
