adriangb commented on issue #19724:
URL: https://github.com/apache/datafusion/issues/19724#issuecomment-3877524429
> I have some issues:
>
> a) If files are overlapped aggressively, how to process?
> b) The decrease in parallelism is something that needs careful
consideration, since query requests are quite diverse, not only topk.
I think those are great questions.
Part of the issue is that even for slightly overlapping files there's
benefit, so there's no black or white.
I'm not sure what heuristic to use or how to approach it.
One thought was that we'll need a combination of the `FileSource` knowing
how to re-arrange files to satisfy a sort order + requested partitions *and* an
optimizer rule with a global view of the query.
For example:
1. Start with unsorted groups and N partitions.
2. Optimizer knows there is an upstream Sort so it ideally wants to set up a
ProgressiveEval type plan.
3. Optimizer pushes down 2 partitions + required sort order.
a. If scan can produce 2 non-overlapping groups it does so and sets it's
output ordering to `Exact`.
The optimizer picks up on this and sets up a ProgressiveEval type plan
(or even just a plan with the Sort with lower parallelism since parallelism
would not be helpful)
b. If the scan cannot produce 2 non-overlapping groups it produces best
effort sorted groups and sets it's ordering to `Inexact`
The optimizer may decide to keep this setup e.g. because there is a
TopK and it's still a better plan to run with lower parallelism.
The optimizer may also decide that it would rather have 3 (or N)
partitions w/ best effort sorting. In this case it discards the 2 partition +
sort order plan and tries to push down a new request for N partitions w/ the
sort order.
So basically: an optimizer decides what sort of plan it wants and pushes
down partitioning + sort order requirements into the scan and depending on what
the scan can support decides what the final plan should be.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]