Hi,

I gave a talk at a meetup last night on how we are planning on using
DataFusion & Arrow at RMS and I will send a link to the recording once I
have it.

One great outcome of the talk was that it was a forcing function for me to
build a couple things:

1. I built a very simple PoC of a RESTful Query Service where we can submit
SQL and receive a result set. I used Rocket and it only took 100 lines of
code or so.

2. A benchmark of parallel query execution for aggregate queries comparing
performance with Apache Spark (with cached DataFrames), which is arguably
an unfair comparison, but the closest I could get to a fair comparison
right now. The results were pretty impressive, with DataFusion showing
significantly better scalability with number of cores (> 10x faster than
Spark with 32 cores for example).

After doing this, I finally feel like I have a good grasp on threading in
Rust, and understand the necessary changes that will be required to the
APIs to allow for relevant state to be shared between threads, and I expect
to have some PRs around this over the next week or two.

I also plan on updating the DataSource traits to be partition aware (return
a Vec of scanners instead of single scanner). I think query execution will
still be single-threaded for 0.13.0 but I'd like to at least get the API
updated for partitioning.

I'm not crazy about the current RecordBatchIterator trait and feel like we
should maybe use a real Iterator<Item=RecordBatch> but I'd prefer to get
the current API thread-safe before we tackle that one.

It might be nice to include the RESTful service as an example in the Arrow
repo. I will create an issue and PR for that.

Thanks,

Andy.

Reply via email to