paul-rogers commented on issue #12746: URL: https://github.com/apache/druid/issues/12746#issuecomment-1184661199
For a bit of color, Apache Drill and Presto/Trino both support extensible input sources, as, of course, does Spark. To support them well, the tool has to have optimizer support to push as much work to the input source as possible. That means Parquet row group pruning, Iceberg versioning, WHERE and join clauses into the JDBC data source, etc. These tools have, over time, built a bundle of tricks that can be used to work out what can be pushed, and how to do that for each source. This work is never done, there are always more tweaks. For Druid, we'd want a connector that can not only do the mechanics of sending requests and reading data, but also provide metadata that says whether the work can be distributed, or single-threaded, what can be pushed down, etc. This can get fancy: we could partition SQL queries and push them to multiple Presto servers. For example, rather that doing `SELECT * FROM sales WHERE saleDate > ?`, spilt it into ten queries, each reading a subset of data. This is called "sharding" in the DB world: [Vitess](https://vitess.io/) does this for queries against sharded MySQL databases at YouTube. One then gets into failure handling: what happens if one of those queries fails? Do we fail the entire insert job in Druid? Can we retry that one item? The connector would have to give us that information. Would be great to understand the use case here. For exploratory use, SQL might be pretty cool: just grab some data from somewhere, load it in Druid, and try out ideas for a new app. Once the app goes into production, however, it would seem more stable to use some tool to pump data into Kafka, then have Druid read from there. It decouples the two operations, which would allow for a simpler system. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
