westonpace commented on PR #13028: URL: https://github.com/apache/arrow/pull/13028#issuecomment-1116491133
`arrow::compute::internal::KeyEncoder` converts an array of values into an array of bytes, such that each value is represented by one or more contiguous bytes. For example, a standard Arrow boolean array is represented by two non-contiguous "bit buffers" of length/8 bytes. KeyEncoder represents each boolean value with two contiguous bytes, one for validity and one for the value. `arrow::compute::internal::RowEncoder` combines the representation from multiple `arrow::compute::internal::KeyEncoder` instances into a a `std::string` for the row. This `std::string` is just a small byte buffer and shouldn't be treated as a "string" in any way. The bytes are the bytes from each key encoder for that row. So if you have three key columns then the string will be the bytes for the first column followed by the bytes for the second column followed by the bytes for the third column. This string can indeed be used as the key for a hash map. This approach works ok, but is not the most performant. A newer version is being integrated which uses `arrow::compute::Hashing64::HashMultiColumn` to calculate an array of 8 bytes hashes. There is no intermediate string that is created. All of this should not be confused with `arrow::compute::KeyEncoder` which is a different class entirely that is worried about converting from a columnar format to a row-based format. This is important in hash-join because the output batches are built from random-access into the hash table. It should not be important for as-of join because the output batches are still built from sequential (though possibly skipping) access to the input tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
