Some of you may have noticed a sudden flurry of activity from me after a bit of a break from the project, so I thought it might be useful to explain what I am up to.
As of 1.0.0, DataFusion isn't really useful against any real-world data sets for a number of reasons, but most of all due to the simplistic threading/partitioning model. There are a few small bugs as well. My current focus is to be able to run TPC-H query 1 against decent size datasets (starting with the 100 GB dataset) with hundreds of partitions. I believe that I can get this working with some fairly small changes. Later, we can experiment with more advanced threading models and async, using the same benchmark to measure improvements. Let me know if you have any questions. Thanks, Andy.