Re: [Rust][DataFusion] Query Engine Design / DataFusion Implementation talk

Micah Kornfield Thu, 18 Mar 2021 08:51:43 -0700

For Java/JVM there is also a discussion on user@  about dataframe libraries.


On Thu, Mar 18, 2021 at 5:47 AM Andrew Lamb <[email protected]> wrote:

> The system you describe sounds quite cool. I don't know what is going on
> the Java world -- as you say I think there is work a foot for technologies
> similar in usecase to DataFusion in C++ (though I suspect the
> implementation will be fairly different)
>
>
>
> On Wed, Mar 17, 2021 at 5:37 PM bobtins <[email protected]> wrote:
>
> > I missed the talk but watched the video, which was fascinating. It helped
> > me get the whole picture of what DataFusion does, which is impressive. In
> > my previous job, I built a data analysis engine on a smaller scale in
> Java,
> > so some of the problems that DataFusion tackles are familiar to me.
> >
> > The initial implementation of my engine would load some data from a
> > relational DB into a columnar memory store that I implemented (very much
> > like Arrow); it would then perform various transformations analogous to
> the
> > logical plan in DataFusion (sort, group, filter, aggregate, etc), but
> also
> > supporting OLAP-like multi-level hierarchies and cubes. This query model
> > didn't have a language itself; the UI manipulated an object model which
> > contained the logical plan (although unfortunately the query model was
> > tangled with other layers).
> >
> > This was later enhanced to generate SQL queries so you wouldn't have to
> > load everything into memory, but you could do in-memory operations on top
> > of the SQL result. I came up with an expression language close to SQL
> which
> > could be translated into in-memory or SQL operations. I had to do
> something
> > like the merge operator in DataFusion to support multi-stage aggregation
> > (e.g. implement count(x) -> sum(count(x)), average(x) ->
> > sum(sum(x))/sum(count(x)), etc. ).
> >
> > Like I said, my framework was nowhere near as heavy-duty as DataFusion +
> > Arrow, but my familiarity with the power of in-memory columnar stores is
> > what drew me to Arrow in the first place.
> >
> > I am curious about how the various language implementations in Arrow are
> > evolving computation frameworks; for Rust, there is DataFusion, and I
> > noticed that there has been a lot of work going on in C++/Python. For
> Java,
> > it seems like this would be in the realm of Gandiva or the dremio
> > product...and of course there's Spark! I am still surveying the terrain,
> > but any pointers to work people are doing in Java would be welcome.
> >
> > On 2021/03/12 19:39:16, Andrew Lamb <[email protected]> wrote:
> > > Here are links to the content, should anyone be interested:
> > >
> > > Query Engine Design and the Rust-Based DataFusion in Apache Arrow
> > > recording: https://www.youtube.com/watch?v=K6eCAVEk4kU
> > > slides: (datafusion content starts on slide 6):
> > >
> >
> https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934
> > >
> > > On Thu, Mar 4, 2021 at 4:05 PM Andrew Lamb <[email protected]>
> wrote:
> > >
> > > > In case anyone is interested in the topic in general or DataFusion in
> > > > particular, I plan a tech talk [1] next week about "Query Engine
> > Design and
> > > > the Rust based DataFusion in Apache Arrow."
> > > >
> > > > If you are curious how (SQL) query engines in general are
> structured, I
> > > > plan to describe the typical high level architecture, using
> DataFusion
> > as
> > > > an exemplar.
> > > >
> > > > It will be held next Wednesday, March 10, 2021 at 8:00 am PST | 4:00
> pm
> > > > GMT, and posted publicly afterwards.
> > > >
> > > > Andrew
> > > >
> > > > [1]
> https://www.influxdata.com/community-showcase/influxdb-tech-talks/
> > > >
> > > >
> > >
> >
>

Re: [Rust][DataFusion] Query Engine Design / DataFusion Implementation talk

Reply via email to