Re: [I] [EPIC] Improve the performance of ListingTable [arrow-datafusion]

via GitHub Mon, 08 Apr 2024 14:29:09 -0700


Lordworms commented on issue #9964:
URL: 
https://github.com/apache/arrow-datafusion/issues/9964#issuecomment-2043673341


   > > And then measure how much time is spent:
   > 
   > that is very interesting
   > 
   > > just want to know what is a good start to solving this issue, should I 
implement the cache
   > 
   > just want to know what is a good start to solving this issue, should I 
implement the cache 
https://github.com/apache/arrow-datafusion/blob/2b0a7db0ce64950864e07edaddfa80756fe0ffd5/datafusion/execution/src/cache/mod.rs
 here first?
   > 
   > If indeed most of the exection time is spent parsing (or fetching) parquet 
metadata, implementing a basic cache would likely help.
   > 
   > Also, @tustvold brought 
https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/trait.ParquetFileReaderFactory.html
 to my attention which might be able to help avoid the overhead
   > 
   > So what I suggest is:
   > 
   > 1. Do a proof of concept (POC - hack it in, don't worry about tests, etc) 
with your approach and see if you can show performance improvements ([WIP: 
Avoid copying LogicalPlans / Exprs during OptimizerPasses 
#9708](https://github.com/apache/arrow-datafusion/pull/9708)  is an example of 
such a PR)
   > 2. If you can show it improves performance significantly, then we can work 
on a final design / tests / etc
   > 
   > The reason to do the POC first is that performance analysis is notoriously 
tricky at the system lavel so you want to have evidence your work will actually 
improve performance before you spend a bunch of time polishing up the PR (it is 
very demotivating, at least to me, to make a beautiful PR only to find out it 
doesn't really help performance)
   
   Got it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [EPIC] Improve the performance of ListingTable [arrow-datafusion]

Reply via email to