Some of you may have noticed a sudden flurry of activity from me after a
bit of a break from the project, so I thought it might be useful to explain
what I am up to.

As of 1.0.0, DataFusion isn't really useful against any real-world data
sets for a number of reasons, but most of all due to the simplistic
threading/partitioning model. There are a few small bugs as well.

My current focus is to be able to run TPC-H query 1 against decent size
datasets (starting with the 100 GB dataset) with hundreds of partitions. I
believe that I can get this working with some fairly small changes. Later,
we can experiment with more advanced threading models and async, using the
same benchmark to measure improvements.

Let me know if you have any questions.

Thanks,

Andy.

Reply via email to