2010YOUY01 commented on PR #18589:
URL: https://github.com/apache/datafusion/pull/18589#issuecomment-3539477713

   > > Thank you for working on it. I think memory-limited equal join is a 
solved problem through sort merge join, though there are some outstanding work 
to further improve it.
   > 
   > I am concerned about your focus solely on performance here. I don't care 
how fast an algo is if I need a machine with 512GB+ of ram to run a join. DF 
runs out of memory a lot for my coworkers for joins to the point that one of my 
coworkers wrote our own disk based external join because neither DF nor DuckDB 
could do the join without OOM.
   > 
   > While grace hash join may potentially be slower overall because of the 
writing to disk, possible re-partitioning if the initial # of partitions was 
too low, etc, it also is pretty much guaranteed to not require large amounts of 
memory no matter what the shape of the left and ride sides of the join are.
   > 
   > Benchmarks of course would be awesome to have, no disagreement there. 
However, I would add that memory usage would be just as important a factor as 
performance.
   
   Sort merge join has already supported spilling, so theoretically it should 
be able to run equal join with very little memory budget.
   But I know in reality it fails in some cases, so I think a better approach 
would be make memory-limited robust in SMJ first, probably it's easier than 
adding another solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to