Re: [PR] POC: Vectorized hashtable for aggregation [datafusion]

via GitHub Sun, 27 Oct 2024 22:46:38 -0700


Rachelint commented on code in PR #12996:
URL: https://github.com/apache/datafusion/pull/12996#discussion_r1818406542



##########
datafusion/physical-plan/src/aggregates/group_values/group_column.rs:
##########
@@ -287,6 +469,63 @@ where
         };
     }
 
+    fn vectorized_equal_to(

Review Comment:
   🤔 Yes, I agree with the row by row checking is indeed not efficient enough, 
and switching the similar implementation in `hash_join` may be really worth 
trying.
   
   Maybe better to try it in the follow on pr? Following points are still not 
clear for me, and I want to experiment about them:
   - If we need a reusable buffer to hold the taken values?
   - It seems that the `cmp` for some arrays like `StringArray` and 
`StringViewArray`  is expansive? 
      Is it better to just check row by row for skipping some unnecessary 
checkings (if `row` not equal in `col a`, actually we don't need to check it 
again in `col b`)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] POC: Vectorized hashtable for aggregation [datafusion]

Reply via email to