Lordworms commented on issue #9964: URL: https://github.com/apache/arrow-datafusion/issues/9964#issuecomment-2041267489
I have done some basic play with the bitcoin dataset  and also did some profiling with instrument > FYI I think this is more like an Epic that can be used to coordinate individual tasks / changes rather than a specific change itself. > > > Interested in this one > > Thanks @Lordworms -- one thing that would probably help to start this project along would be to gather some data. > > Specifically, put the LIstingTable against data on a remote object store (eg. figure out how to write a query against 100 parquet files on an S3 bucket). > > And then measure how much time is spent: > > 1. object store listing <img width="1202" alt="image" src="https://github.com/apache/arrow-datafusion/assets/48054792/0efbf812-3ed5-4603-8ee5-9fbe5a1b365b"> > 2. fetching metadata <img width="1175" alt="image" src="https://github.com/apache/arrow-datafusion/assets/48054792/2fca61e9-05c7-4366-8b6f-a72c8d80f6dc"> > 3. pruning / fetching IO > 4. How many object store requests are made > > Does anyone know a good public data set on S3 that we could use to test / benchmark with? just want to know what is a good start to solving this issue, should I implement the cache https://github.com/apache/arrow-datafusion/blob/2b0a7db0ce64950864e07edaddfa80756fe0ffd5/datafusion/execution/src/cache/mod.rs here first? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
