[GitHub] [druid] paul-rogers commented on issue #12746: If SQL Input Source can support Hive, Iceberg, and Presto JDBC, that would be super awesome and I will tell you why.

GitBox Thu, 14 Jul 2022 09:42:20 -0700


paul-rogers commented on issue #12746:
URL: https://github.com/apache/druid/issues/12746#issuecomment-1184661199


   For a bit of color, Apache Drill and Presto/Trino both support extensible 
input sources, as, of course, does Spark. To support them well, the tool has to 
have optimizer support to push as much work to the input source as possible. 
That means Parquet row group pruning, Iceberg versioning, WHERE and join 
clauses into the JDBC data source, etc. These tools have, over time, built a 
bundle of tricks that can be used to work out what can be pushed, and how to do 
that for each source. This work is never done, there are always more tweaks.
   
   For Druid, we'd want a connector that can not only do the mechanics of 
sending requests and reading data, but also provide metadata that says whether 
the work can be distributed, or single-threaded, what can be pushed down, etc.
   
   This can get fancy: we could partition SQL queries and push them to multiple 
Presto servers. For example, rather that doing `SELECT * FROM sales WHERE 
saleDate > ?`, spilt it into ten queries, each reading a subset of data. This 
is called "sharding" in the DB world: [Vitess](https://vitess.io/) does this 
for queries against sharded MySQL databases at YouTube.
   
   One then gets into failure handling: what happens if one of those queries 
fails? Do we fail the entire insert job in Druid? Can we retry that one item? 
The connector would have to give us that information.
   
   Would be great to understand the use case here. For exploratory use, SQL 
might be pretty cool: just grab some data from somewhere, load it in Druid, and 
try out ideas for a new app. Once the app goes into production, however, it 
would seem more stable to use some tool to pump data into Kafka, then have 
Druid read from there. It decouples the two operations, which would allow for a 
simpler system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on issue #12746: If SQL Input Source can support Hive, Iceberg, and Presto JDBC, that would be super awesome and I will tell you why.

Reply via email to