geoffreyclaude commented on PR #18832:
URL: https://github.com/apache/datafusion/pull/18832#issuecomment-3628557627

   > > Another general comment, on the implementation this time: hashing seems 
overkill and probably overly expensive for small simple type lists.
   > > @adriangb have you considered sorting the `InList` and doing a binary 
search?
   > > For small lists (threshold to be refined...) this could be orders of 
magnitude faster that the overhead of hashing. To validate with the new fixed 
benchmarks of course!
   > 
   > I think that's a really neat idea! I did look into using binary search at 
some point (not this PR) but then realized that there was already the hashing 
in place and didn't pursue it further. So while I agree it could be interesting 
to have two approaches, I think we should focus on improving the current 
approach right now especially since we probably degraded performance since last 
release. Maybe let's file a ticket to track benchmarking that approach?
   
   This was too much fun to skip the opportunity! I've opened a draft PR on 
your branch that shows pretty cool results: 
https://github.com/pydantic/datafusion/pull/46


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to