>
>
> > > - By "pipeline-breaking" I assume you mean "very slow", but can you
> give
> > me
> > details? Does this arise from some particular observation, or other
> > reported issues?
> >
> > In general pipeline breaking means that the output of the operator can't
> be
> > produced until it has seen *ALL* its input.
> >
> > For example, a sort (ORDER BY x) is a pipeline breaker because the engine
> > has to see the entire input prior to being able to produce any output.
> >
> > However, a filter (WHERE x > 500) is not a pipeline breaker because the
> > operator can produce output rows as soon as it sees any that pass the
> > filter criteria.
> >
>
> Aha, I get it--so the goal is not necessarily to speed up the whole thing
> but to be able to send output to the next processing stage sooner.
> So IIRC besides sorts, the other types of queries mentioned were joins,
> group by, and hash aggregates?
>

Yes. DataFusion uses hash aggregates today to implement GROUP BY so I
probably wouldn't describe them as different types of queries.


>
> >
> > > - In general, what tools are you using to analyze datafusion
> performance?
> >
> > The tools used most commonly are in the benchmark directory [1] There is
> > some other work
> >
> > >  - How much profiling have you done to identify bottlenecks?
> >
> > I would say it is done on an "as needed basis" -- namely someone runs a
> > query that is important to them and then improves whatever hotspot they
> may
> > find.
> >
> > However, we don't have regular runs of the same queries or automatically
> > gather data over time. dianaclarke added integration for condabench in
> [2]
> > that I think would allow for such data collection, but no one has hooked
> up
> > the benchmarks to it uet.
> >
> > Getting regular runs of the performance benchmark up and running would be
> > very valuable indeed, if you were looking to help.
> >
> > Yes, I'm definitely looking to help, and maybe getting more perf
> benchmarks up would be a good way of starting.
> I noticed that matthewmturner was working on something to run benchmarks in
> docker, which is pretty nice! [3]
> Any suggestions for performance use cases would be welcome; I could add
> them in.
> One thing I like to do is to run the same benchmark and tweak the knobs,
> such as number of rows, cardinality, etc. because the effects can vary A
> LOT.
>

I agree this is a great strategy.  I don't think there is any reason we
can't do this to DataFusion, but no one has yet invested the time into a
systematic investigation.

I am tempted to venture opinions on how to do things based on my experience
> building my own (closed-source) columnar query engine, but that one is an
> entirely different beast, so I am not qualified to opine until I learn
> more.
> I'm starting to follow history about various performance improvements, but
> if anyone has any suggestion, like "I wish datafusion could complete X
> query on 50 bazillion rows in less than 3 days", let me know. In
> performance, there are so many variables that it's hard to know where to
> start.
>
>
I think the TPCH queries would be a great place to start. Specifically,
getting to the point where we had a baseline number at Scale Factor 1 would
be great. DataFusion doesn't run some of them due to lack of various
features, but I can't remember how well they are tracked. Getting DF to the
point where all the queries complete in a "reasonable" amount of time would
be pretty awesome

Andrew

Reply via email to