alamb commented on issue #9964:
URL: 
https://github.com/apache/arrow-datafusion/issues/9964#issuecomment-2043463422

   > And then measure how much time is spent:
   
   that is very interesting
   
   > just want to know what is a good start to solving this issue, should I 
implement the cache
   
   just want to know what is a good start to solving this issue, should I 
implement the cache 
https://github.com/apache/arrow-datafusion/blob/2b0a7db0ce64950864e07edaddfa80756fe0ffd5/datafusion/execution/src/cache/mod.rs
 here first?
   
   If indeed most of the exection time is spent parsing (or fetching) parquet 
metadata, implementing a basic cache would likely help.
   
   Also, @tustvold  brought 
https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/trait.ParquetFileReaderFactory.html
 to my attention which might be able to help avoid the overhead
   
   So what I suggest is:
   1. Do a proof of concept (POC - hack it in, don't worry about tests, etc) 
with your approach and see if you can show performance improvements 
(https://github.com/apache/arrow-datafusion/pull/9708  is an example of such a 
PR)
   2. If you can show it improves performance significantly, then we can work 
on a final design / tests / etc
   
   The reason to do the POC first is that performance analysis is notoriously 
tricky at the system lavel so you want to have evidence your work will actually 
improve performance before you spend a bunch of time polishing up the PR  (it 
is very demotivating, at least to me, to make a beautiful PR only to find out 
it doesn't really help performance)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to