westonpace commented on PR #13028:
URL: https://github.com/apache/arrow/pull/13028#issuecomment-1116491133

   `arrow::compute::internal::KeyEncoder` converts an array of values into an 
array of bytes, such that each value is represented by one or more contiguous 
bytes.  For example, a standard Arrow boolean array is represented by two 
non-contiguous "bit buffers" of length/8 bytes.  KeyEncoder represents each 
boolean value with two contiguous bytes, one for validity and one for the value.
   
   `arrow::compute::internal::RowEncoder` combines the representation from 
multiple `arrow::compute::internal::KeyEncoder` instances into a a 
`std::string` for the row.  This `std::string` is just a small byte buffer and 
shouldn't be treated as a "string" in any way.  The bytes are the bytes from 
each key encoder for that row.
   
   So if you have three key columns then the string will be the bytes for the 
first column followed by the bytes for the second column followed by the bytes 
for the third column.  This string can indeed be used as the key for a hash map.
   
   This approach works ok, but is not the most performant.  A newer version is 
being integrated which uses `arrow::compute::Hashing64::HashMultiColumn` to 
calculate an array of 8 bytes hashes.  There is no intermediate string that is 
created.
   
   All of this should not be confused with `arrow::compute::KeyEncoder` which 
is a different class entirely that is worried about converting from a columnar 
format to a row-based format.  This is important in hash-join because the 
output batches are built from random-access into the hash table.  It should not 
be important for as-of join because the output batches are still built from 
sequential (though possibly skipping) access to the input tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to