Re: Question about parallel query planning

JiaTao Tao Thu, 11 Mar 2021 18:45:56 -0800

You can move the union merge rule to hep, this rule has the benefit.


Regards!

Aron Tao


Jihoon Son <[email protected]> 于2021年3月12日周五 上午6:36写道：

> Julian, thank you for the pointers. I will look at more closely how we
> can make the shared data structures thread-safe.
>
> > Are these, by any chance, pair-wise unions that can be flattened to
> n-way unions? That kind of transformation is almost always beneficial.
>
> Can you elaborate more on this idea? What do you mean by n-way unions?
> This particular query I'm looking at is a flat union query that has
> 121 simple scan queries. All these subqueries have a pattern of
> "SELECT 'string_literal' FROM table WHERE filter LIMIT 1". I think
> this union query can be rewritten to a similar query to avoid using
> UNIONs at all, such as a scan query using CASE WHEN or an aggregate
> query using aggregate functions with FILTER. However, execution time
> of the rewritten query would be slower than the original union query
> as the LIMIT clause cannot be pushed down to scan.
>
> > my advice is to move some rules to hep planner, like sub-query remove,
> union merge, etc.
>
> Aron, thank you for the tip. The subquery remove rule is already
> processed by HepPlanner. For the union merge rule, we disabled it
> because of the performance issue with unions. Do you have any other
> suggestions for what rules to move?
>
> Thanks,
> Jihoon
>
> On Thu, Mar 11, 2021 at 1:11 PM Julian Hyde <[email protected]>
> wrote:
> >
> > Are these, by any chance, pair-wise unions that can be flattened to
> n-way unions? That kind of transformation is almost always beneficial.
> >
> > Julian
> >
> > > On Mar 11, 2021, at 12:34 AM, JiaTao Tao <[email protected]> wrote:
> > >
> > > Hi Jihoon Son
> > > I met the same problem(hundreds of union), and my advice is to move
> some
> > > rules to hep planner, like sub-query remove, union merge, etc. And this
> > > works for me.
> > >
> > > Regards!
> > >
> > > Aron Tao
> > >
> > >
> > > Julian Hyde <[email protected]> 于2021年3月10日周三 上午2:59写道：
> > >
> > >> At a high level, the Volcano/Cascades planning algorithm is amenable
> > >> to parallelization. It uses a "work queue" (of matched rules that have
> > >> not been applied yet) and each task is additive (adds relational
> > >> expressions to the graph of relational expressions and their
> > >> equivalence sets, and things are immutable once added to the graph).
> > >>
> > >> The devil will be in the details: making sure that the shared data
> > >> structures work correctly when other threads are modifying them. For
> > >> example, what happens when I try to add a RelNode to a set that is
> > >> currently being merged merged with another set?
> > >>
> > >> Other shared data structures include metadata (aka statistics) and
> > >> type factories. I think that their APIs are in fairly good shape for
> > >> making them parallel.
> > >>
> > >> Julian
> > >>
> > >>
> > >>> On Tue, Mar 9, 2021 at 10:45 AM Jihoon Son <[email protected]>
> wrote:
> > >>>
> > >>> Hi Vladimir, thank you for your reply.
> > >>>
> > >>> 5 sec might not be bad from a technical point of view, but our user
> > >>> wants their queries to finish in 2 - 3 seconds including planning
> > >>> time. The actual query execution time for this particular query was 2
> > >>> seconds which can be improved to 20 ms in my testing. However, the
> > >>> planning time is the bottleneck and thus improving execution time did
> > >>> not help much in this case.
> > >>>
> > >>>> Did you have a chance to check which exact rules contributed to the
> > >> planning time? You may inject a listener to VolcanoPlanner to check
> that.
> > >>>
> > >>> I didn't before, so I just looked at the code to learn how to inject
> a
> > >>> listener to VolcanoPlanner. But I'm not sure how I can do it. We are
> > >>> creating a org.apache.calcite.prepare.PlannerImpl using
> > >>> org.apache.calcite.tools.Frameworks.getPlanner()
> > >>> (
> > >>
> https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidPlanner.java#L89
> > >> ).
> > >>> This PlannerImpl has VolcanoPlanner in it, but neither expose it to
> > >>> outside nor provide an interface for adding a listener. I guess I can
> > >>> add an interface in PlannerImpl (and Planner) and make a custom build
> > >>> of Calcite. But I'm wondering if there is a way that I can inject a
> > >>> listener without making a custom build.
> > >>>
> > >>> Jihoon
> > >>>
> > >>> On Tue, Mar 9, 2021 at 12:03 AM Vladimir Ozerov <[email protected]>
> > >> wrote:
> > >>>>
> > >>>> *at such = at such scale
> > >>>>
> > >>>> Вт, 9 марта 2021 г. в 11:01, Vladimir Ozerov <[email protected]>:
> > >>>>
> > >>>>> Hi Jihoon,
> > >>>>>
> > >>>>> I would say that 5 sec could be actually a pretty good result at
> > >> such. Did
> > >>>>> you have a chance to check which exact rules contributed to the
> > >> planning
> > >>>>> time? You may inject a listener to VolcanoPlanner to check that.
> > >>>>>
> > >>>>> Regards,
> > >>>>> Vladimir
> > >>>>>
> > >>>>> Вт, 9 марта 2021 г. в 05:37, Jihoon Son <[email protected]>:
> > >>>>>
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> I posted the same question on the ASF slack channel, but am
> posting
> > >>>>>> here as well to get a quicker response.
> > >>>>>>
> > >>>>>> I'm seeing an issue in query planning that it takes a long time
> (+5
> > >>>>>> sec) for a giant union query that has 120 subqueries in it. I
> > >> captured
> > >>>>>> a flame graph (attached in this email) to see where the bottleneck
> > >> is,
> > >>>>>> and based on the flame graph, I believe the query planner spent
> most
> > >>>>>> of time to explore the search space of candidate plans to find the
> > >>>>>> best plan. This seems because of those many subqueries in the same
> > >>>>>> union query. Is my understanding correct? If so, for this
> particular
> > >>>>>> case, it seems possible to parallelize exploring the search space.
> > >> Do
> > >>>>>> you have any plan for parallelizing this part? I'm not sure
> whether
> > >>>>>> it's already done though in the master branch. I tried to search
> > >> for a
> > >>>>>> jira ticket on https://issues.apache.org/jira/browse/CALCITE, but
> > >>>>>> couldn't find anything with my search skill.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Jihoon
> > >>>>>>
> > >>>>>
> > >>
>

Re: Question about parallel query planning

Reply via email to