michalursa opened a new pull request #10858:
URL: https://github.com/apache/arrow/pull/10858


   This change contains some refactoring of the code of the core hash table 
implementation used in grouper.
   The goal of this change is to separate read-only operations on the hash 
table from operations modifying it.
   
   Originally the only provided operation for hash table access was the map 
operation, that would return ids of matching keys found in the hash table or 
automatically insert new keys if they are not found assigning them new ids.
   
   The change splits the map operation into a pipeline consisting of three 
stages:
   - early filter
   - find
   - map_new_keys
   The three of them called one after another provide the functionality of the 
map. The output of each of them is used as the input to the next one. Each next 
stage in the pipeline can potentially process a smaller subset of rows than the 
previous stage, because of filtering of the rows done at each stage. 
   
   Early filter corresponds to the part that can be seen as an equivalent of 
Bloom filter. It quickly, based on hash values only and without executing any 
key comparison, marks the keys that definitely do not have a match in the hash 
table. False positives are possible, but as with the Bloom filter, their 
probability should be small. 
   
   The next stage in the pipeline, find method, correspond to the full 
processing of all of the input rows with keys that are already present in the 
hash table. It is a read-only operation on the hash table. It finishes 
filtering from early filter getting rid of any potential false positives. It 
also outputs corresponding group ids for all keys found in the hash table. The 
caller may ignore group ids if only the filtering part is important, but there 
is no meaningful performance overhead in outputting them, since they are needed 
anyway for executing key comparisons. 
   
   The final stage of the pipeline is completing the pre-existing map 
functionality, processing all new keys from the last batch. The set of new keys 
is identified by the result of the previous stage - filter operation. The last 
stage takes care of inserting new keys, assigning them new group ids, resizing 
the hash table when it gets too full. The number of inserted keys may be 
smaller than the number of keys passed to this stage, since there may be 
duplicates among them.
   
   The restructuring of the code should not only be useful for the 
implementation of join exec node, but it should also help in the future in 
implementation of shared multi-threaded access. Only the last stage of the 
pipeline can modify the hash table, so it is the only one that requires thread 
synchronization. At the same time it only processes the keys that were not 
present in the hash table when the processing of the exec batch started, so it 
can be expected in many cases to be a small fraction of all the inputs.
   
   This change is not tested yet.  
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to