gianm commented on issue #12746: URL: https://github.com/apache/druid/issues/12746#issuecomment-1175662006
I agree totally that it would be a cool feature. my 2ยข on implementation: The SQL input source is built to be super generic and pull from any database over JDBC. It's flexible, but I think performance and scalability would not be sufficient for pulling large amounts of data from a data lake. IMHO the best place to do Iceberg (or Delta Lake) integration would be in the InputSources. There's a couple of ways to do it: 1. Add a "file chooser" option to existing input sources like `s3`, `google`, `hdfs`, etc. Implement Iceberg and Delta Lake file choosers. The file lists would be filtered through these choosers, and then our regular code would take over. 2. `iceberg` and `deltaLake` input sources that accept data specs in whatever form is most natural for the system. Internally those input sources would need to have some code for talking to S3, GCS, HDFS, or wherever else data is stored. They'd likely do it through libraries provided by those other projects. This wouldn't compose with our existing `s3`, `google`, `hdfs`, etc, input sources. To me, the first option is preferable, unless there is some downside that requires the second option. Experience with Hadoop suggests it's not a great idea to let the integration fully control communication with remote storage, and it's better to do composable approaches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
