gabotechs commented on PR #19761:
URL: https://github.com/apache/datafusion/pull/19761#issuecomment-3744403441
> I am wondering what the speedup would be if we just load a single batch
per partition instead of based on size - perhaps the speedup is similar with
reduced overhead?
Tested it locally and it does not seem to have a significant impact. Even
doing an unbounded buffering has no significant impact. What I get from that is
that the main driver for speed is the fact that something forces the probe side
`RecordBatchStream`s to make progress whether that implies buffering actual
record batches or not.
> While I am excited about the speedup, I wonder if this is not mostly
"hiding" some "issues" with our current join implementation: hashing and
concatenation is currently single-threaded for CollectLeft, both could be
parallelized. Also we could lower the overhead of the parallel hash join and
reduce the threshold to make the total query run more in parallel and avoid
time spent waiting on one not well parallelizable.
This can indeed reduce the impact of any issue affecting build side creation
speed. I would not say "hiding" as the issues should still be visible, just
that the overall query latency is no longer the best metric to discover those.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]