Lordworms commented on issue #9964: URL: https://github.com/apache/arrow-datafusion/issues/9964#issuecomment-2043673341
> > And then measure how much time is spent: > > that is very interesting > > > just want to know what is a good start to solving this issue, should I implement the cache > > just want to know what is a good start to solving this issue, should I implement the cache https://github.com/apache/arrow-datafusion/blob/2b0a7db0ce64950864e07edaddfa80756fe0ffd5/datafusion/execution/src/cache/mod.rs here first? > > If indeed most of the exection time is spent parsing (or fetching) parquet metadata, implementing a basic cache would likely help. > > Also, @tustvold brought https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/trait.ParquetFileReaderFactory.html to my attention which might be able to help avoid the overhead > > So what I suggest is: > > 1. Do a proof of concept (POC - hack it in, don't worry about tests, etc) with your approach and see if you can show performance improvements ([WIP: Avoid copying LogicalPlans / Exprs during OptimizerPasses #9708](https://github.com/apache/arrow-datafusion/pull/9708) is an example of such a PR) > 2. If you can show it improves performance significantly, then we can work on a final design / tests / etc > > The reason to do the POC first is that performance analysis is notoriously tricky at the system lavel so you want to have evidence your work will actually improve performance before you spend a bunch of time polishing up the PR (it is very demotivating, at least to me, to make a beautiful PR only to find out it doesn't really help performance) Got it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
