[GitHub] [arrow-datafusion] alamb commented on pull request #1782: Introduce `Row` format backed by raw bytes

GitBox Wed, 09 Feb 2022 14:27:39 -0800


alamb commented on pull request #1782:
URL: 
https://github.com/apache/arrow-datafusion/pull/1782#issuecomment-1034262113



   > [Use directly] The current row implementation is mostly targeted at 
payload use cases, that we do not update or check by index after the first 
write, and only decompose to record batch at last. This is the case for sort 
payload, hash aggregate composed grouping key (we can directly compare raw 
bytes for equality), hash join key, and join payload.
   
   I am not sure about sorting payloads. It seems to me like copying around the 
sort payload will potentially be quite ineffecient
   
   Consider a table like
   ```sql
   CREATE TABLE sales (
     value float,
     first_name varchar(2000),
     last_name varchar(2000),
     address varchar(2000)
   )
   ```
    
   And a query like
   
   ```sql
   SELECT * from sales order by value;
   ```
   
   In this case only value needs to be compared, and the payload may be 
substantial
   
   I thought the current state of the art was to do something like 
   1. `RecordBatch`es --> `Rows` (only for sort key)
   2. Sort the `Rows` using `memcmp`
   3. Use `take` kernel to form the output rows (copy payload columns)
   
   This copies the payload columns only once
   
   If you instead use the Row to hold the payload, you end up
   
   1. `RecordBatch`es --> `Row payload`
   2. Form something to compare using
   2. Sort
   3. Now covert back from Row to `RecordBatch`
   
   Which results in copying the payloads *twice* - and for large tables this is 
a substantial overhead. 
   
   
   However, I agree a format like this be helpful for storing hash aggregate 
composed grouping keys, join keys (and maybe intermediate aggregates)
   
   I'll give this PR a good look tomorrow morning
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on pull request #1782: Introduce `Row` format backed by raw bytes

Reply via email to