Hi, I gave a talk at a meetup last night on how we are planning on using DataFusion & Arrow at RMS and I will send a link to the recording once I have it.
One great outcome of the talk was that it was a forcing function for me to build a couple things: 1. I built a very simple PoC of a RESTful Query Service where we can submit SQL and receive a result set. I used Rocket and it only took 100 lines of code or so. 2. A benchmark of parallel query execution for aggregate queries comparing performance with Apache Spark (with cached DataFrames), which is arguably an unfair comparison, but the closest I could get to a fair comparison right now. The results were pretty impressive, with DataFusion showing significantly better scalability with number of cores (> 10x faster than Spark with 32 cores for example). After doing this, I finally feel like I have a good grasp on threading in Rust, and understand the necessary changes that will be required to the APIs to allow for relevant state to be shared between threads, and I expect to have some PRs around this over the next week or two. I also plan on updating the DataSource traits to be partition aware (return a Vec of scanners instead of single scanner). I think query execution will still be single-threaded for 0.13.0 but I'd like to at least get the API updated for partitioning. I'm not crazy about the current RecordBatchIterator trait and feel like we should maybe use a real Iterator<Item=RecordBatch> but I'd prefer to get the current API thread-safe before we tackle that one. It might be nice to include the RESTful service as an example in the Arrow repo. I will create an issue and PR for that. Thanks, Andy.