2010YOUY01 commented on issue #16065: URL: https://github.com/apache/datafusion/issues/16065#issuecomment-2888233712
Welcome aboard! We're excited to collaborate with you for this GSoC project 😄 Regarding the plan, I can see the following sub-tasks: 1. Stabilize external sort and aggregate. 2. Implement a memory-limited nested loop join (NLJ). This serves as a safe fallback in case external sort-merge join (SMJ) or future external hash join (HJ) implementations fail in certain scenarios. It can also be used for differential testing against other join executor implementations. 3. Optimize the spill format, likely building on top of Arrow's IPC stream reader/writer. (And also improve UX/performance along the way) I plan to open separate issues for each sub-task to better describe the problems and outline the approaches. Are there any other tasks worth exploring? I'm not very familiar with Arrow IPC internal, are there any stream reader/writer–related tasks we could also consider? @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org