[I] [DISCUSSION] Make it easy and fast to query files on remote files (S3, iceberg, etc) [datafusion]

via GitHub Sun, 17 Nov 2024 10:55:37 -0800


alamb opened a new issue, #13456:
URL: https://github.com/apache/datafusion/issues/13456


   ### Is your feature request related to a problem or challenge?
   
   I personally think making it easy to use DataFusion with the "open data 
lake" stack is very important over the next few months. 
   
   @julienledem wrote up a very nice piece describing [The advent of the Open 
Data 
Lake](https://sympathetic.ink/2024/11/07/The-Advent-Of-The-Open-Data-Lake.html)
   
   The high level idea is to make it really easy for people to build systems 
that query (quickly!) from parquet files stored on remote object store, 
including Apache Iceberg, Delta Lake, Hudi, etc.
   
   You *can* already use DataFusion (and `datafusion-cli`) to query such data, 
but it takes non trivial effort to configure and tune for good performance. My 
idea is to make it easier to do so / make DataFusion better out of the box. 
   
   With that as a building block, people could/would build applications and 
systems targeting specific usecases
   
   I don't yet fully understand where we currently stand on this goal, but I 
wanted to start hte discussio
   
   ### Describe the solution you'd like
   
   
   In my mind, the specific work this entails stuff like
   
   - [ ]  Making it easier to use iceberg/delta/hudi with DataFusion
   - [ ] https://github.com/apache/datafusion/issues/12393
   - [ ] Make parquet reader in arrow-rs faster/better on remote object stores
   - [ ] Making it eaiser to cache parquet metadata
   
   ### Describe alternatives you've considered
   
   One specific item, brought up by @MrPowers would be to try DataFusion with 
the "10B row challenge" described in 
https://dataengineeringcentral.substack.com/p/10-billion-row-challenge-duckdb-vs
 .
   
    I suspect it would be non ideal at first, but trying it to figure out what 
the challenges are would help us focus our efforts
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [DISCUSSION] Make it easy and fast to query files on remote files (S3, iceberg, etc) [datafusion]

Reply via email to