alamb commented on PR #21882: URL: https://github.com/apache/datafusion/pull/21882#issuecomment-4412430851
> @alamb I opened this draft PR to get early feedback on the architecture. > > 1. The first point is around the sync read path. I introduced `open_sync_reader` because `SortMergeJoin` currently has synchronous, blocking code paths that directly open files using paths and `BufReader`, instead of going through the spill abstractions. Converting this to fully async would significantly increase the scope of this PR. > > * Does it make sense to keep this escape hatch for now and handle making these operators async in a follow-up PR? Kind of, though it seems like accumulating technical debt as we'll have APIs that will not be needed once we complete the work for SortMergeJoin What do you think about making a first PR to migrate SortMergeJoin to use the spill abstraction? > 2. The second point is regarding test failures. I have not modified the original 64B limit in the tests because I wanted guidance here. Currently, the `repartition` test in `mod.rs` is failing, and it seems related to spilling not being triggered correctly, the new `SpillWriteAdapter` adds slight allocation overhead which makes the original 64-byte memory limit too tight for the merge heap to initialize (~296 bytes needed), bumping up the memory limit causes the test to not spill anymore, I believe increasing the test data size might solve the issue, but am not sure. Makes sense to me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
