Re: My focus for Rust implementation for 2.0.0

Andy Grove Fri, 14 Aug 2020 06:59:10 -0700

First, an update on progress. Once the PRs for ARROW-9711 and ARROW-9716
are merged, it is possible to run TPC-H query 1 against a 100 GB data set
with similar performance to Apache Spark in local mode. I plan on testing
larger datasets over the weekend.

To answer Kirill's question, I wouldn't necessarily characterize it as
giving up on exploring any integration with Gandiva. There are several
integrations that I would be interested in exploring, including with the
Arrow C Data Interface, and the C++ Dataset work that is happening, but I
only have so much time available to contribute to this project and I have
some specific goals that I am working towards that are a much higher
priority for me right now.

Also, I am encouraged by the performance I'm seeing from DataFusion after
some of the changes this week, and I know there is plenty of room for
improvement still. This perhaps makes it less compelling to explore
delegating to C++ at this point. However, it would be nice to see some
performance comparisons between DataFusion and the C++ Dataset work.

Thanks,

Andy.

On Fri, Aug 14, 2020 at 2:18 AM Kirill Lykov <lykov.kir...@gmail.com> wrote:

> Sounds interesting as we wanted to start using DataFusion.
> Btw, I vaguely remember that in the original repository you had issue
> like "investigate DataFusion with Gandiva", I'm curious  why you have
> decided to give up with it?
>
> On Thu, Aug 13, 2020 at 5:11 PM Andy Grove <andygrov...@gmail.com> wrote:
> >
> > Some of you may have noticed a sudden flurry of activity from me after a
> > bit of a break from the project, so I thought it might be useful to
> explain
> > what I am up to.
> >
> > As of 1.0.0, DataFusion isn't really useful against any real-world data
> > sets for a number of reasons, but most of all due to the simplistic
> > threading/partitioning model. There are a few small bugs as well.
> >
> > My current focus is to be able to run TPC-H query 1 against decent size
> > datasets (starting with the 100 GB dataset) with hundreds of partitions.
> I
> > believe that I can get this working with some fairly small changes.
> Later,
> > we can experiment with more advanced threading models and async, using
> the
> > same benchmark to measure improvements.
> >
> > Let me know if you have any questions.
> >
> > Thanks,
> >
> > Andy.
>
>
>
> --
> Best regards,
> Kirill Lykov
>

Re: My focus for Rust implementation for 2.0.0

Reply via email to