gianm commented on issue #12746:
URL: https://github.com/apache/druid/issues/12746#issuecomment-1175662006

   I agree totally that it would be a cool feature. my 2ยข on implementation:
   
   The SQL input source is built to be super generic and pull from any database 
over JDBC. It's flexible, but I think performance and scalability would not be 
sufficient for pulling large amounts of data from a data lake. IMHO the best 
place to do Iceberg (or Delta Lake) integration would be in the InputSources. 
There's a couple of ways to do it:
   
   1. Add a "file chooser" option to existing input sources like `s3`, 
`google`, `hdfs`, etc. Implement Iceberg and Delta Lake file choosers. The file 
lists would be filtered through these choosers, and then our regular code would 
take over.
   2. `iceberg` and `deltaLake` input sources that accept data specs in 
whatever form is most natural for the system. Internally those input sources 
would need to have some code for talking to S3, GCS, HDFS, or wherever else 
data is stored. They'd likely do it through libraries provided by those other 
projects. This wouldn't compose with our existing `s3`, `google`, `hdfs`, etc, 
input sources.
   
   To me, the first option is preferable, unless there is some downside that 
requires the second option. Experience with Hadoop suggests it's not a great 
idea to let the integration fully control communication with remote storage, 
and it's better to do composable approaches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to