First, an update on progress. Once the PRs for ARROW-9711 and ARROW-9716 are merged, it is possible to run TPC-H query 1 against a 100 GB data set with similar performance to Apache Spark in local mode. I plan on testing larger datasets over the weekend.
To answer Kirill's question, I wouldn't necessarily characterize it as giving up on exploring any integration with Gandiva. There are several integrations that I would be interested in exploring, including with the Arrow C Data Interface, and the C++ Dataset work that is happening, but I only have so much time available to contribute to this project and I have some specific goals that I am working towards that are a much higher priority for me right now. Also, I am encouraged by the performance I'm seeing from DataFusion after some of the changes this week, and I know there is plenty of room for improvement still. This perhaps makes it less compelling to explore delegating to C++ at this point. However, it would be nice to see some performance comparisons between DataFusion and the C++ Dataset work. Thanks, Andy. On Fri, Aug 14, 2020 at 2:18 AM Kirill Lykov <lykov.kir...@gmail.com> wrote: > Sounds interesting as we wanted to start using DataFusion. > Btw, I vaguely remember that in the original repository you had issue > like "investigate DataFusion with Gandiva", I'm curious why you have > decided to give up with it? > > On Thu, Aug 13, 2020 at 5:11 PM Andy Grove <andygrov...@gmail.com> wrote: > > > > Some of you may have noticed a sudden flurry of activity from me after a > > bit of a break from the project, so I thought it might be useful to > explain > > what I am up to. > > > > As of 1.0.0, DataFusion isn't really useful against any real-world data > > sets for a number of reasons, but most of all due to the simplistic > > threading/partitioning model. There are a few small bugs as well. > > > > My current focus is to be able to run TPC-H query 1 against decent size > > datasets (starting with the 100 GB dataset) with hundreds of partitions. > I > > believe that I can get this working with some fairly small changes. > Later, > > we can experiment with more advanced threading models and async, using > the > > same benchmark to measure improvements. > > > > Let me know if you have any questions. > > > > Thanks, > > > > Andy. > > > > -- > Best regards, > Kirill Lykov >