Lordworms commented on issue #9964:
URL: 
https://github.com/apache/arrow-datafusion/issues/9964#issuecomment-2041267489

   I have done some basic play with the bitcoin dataset
   
![17347f2f94015d8396ec20a0817a6f09](https://github.com/apache/arrow-datafusion/assets/48054792/84beed34-f1f9-4f3f-b485-c7a312a9778f)
   and also did some profiling with instrument
   
   
   > FYI I think this is more like an Epic that can be used to coordinate 
individual tasks / changes rather than a specific change itself.
   > 
   > > Interested in this one
   > 
   > Thanks @Lordworms -- one thing that would probably help to start this 
project along would be to gather some data.
   > 
   > Specifically, put the LIstingTable against data on a remote object store 
(eg. figure out how to write a query against 100 parquet files on an S3 bucket).
   > 
   > And then measure how much time is spent:
   > 
   > 1. object store listing
   <img width="1202" alt="image" 
src="https://github.com/apache/arrow-datafusion/assets/48054792/0efbf812-3ed5-4603-8ee5-9fbe5a1b365b";>
   > 2. fetching metadata
   <img width="1175" alt="image" 
src="https://github.com/apache/arrow-datafusion/assets/48054792/2fca61e9-05c7-4366-8b6f-a72c8d80f6dc";>
   
   > 3. pruning / fetching IO
   > 4. How many object store requests are made
   > 
   > Does anyone know a good public data set on S3 that we could use to test / 
benchmark with?
   
   just want to know what is a good start to solving this issue, should I 
implement the cache 
https://github.com/apache/arrow-datafusion/blob/2b0a7db0ce64950864e07edaddfa80756fe0ffd5/datafusion/execution/src/cache/mod.rs
 here first?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to