Re: [I] Bad performance on wide tables (1000+ columns) [arrow-datafusion]

via GitHub Mon, 13 Nov 2023 09:27:20 -0800


alamb commented on issue #7698:
URL: 
https://github.com/apache/arrow-datafusion/issues/7698#issuecomment-1808638503


   > For instance, in our case schema is being changed many times rarely then 
data and we can cache it for a long period of time
   
   We do something similar to this in IOx (cache schemas that we know don't 
change rather than recomputing them)
   
   It is my opinion that in order to make DFSchema behave well and not be a 
bottleneck we will need to more fundamentally restructure how it works. 
   
   Right now the amount of copying required is substantial as has been pointed 
out several times on this thread. I think with sufficient diligence we could 
avoid almost all copies when manipulating DFSchema and then the extra 
complexity of adding a cache or other techniques would become unnecessary.  
   
   > I've additionally changed the type of the precomputed qualified_name from 
String to Arc<String> after the above tests. Total planning time reduced to 75% 
of the previous iteration. But I think it is still far from optimal.
   
   I think this is a great idea.  I think optimizing for the case of the same, 
reused qualifier, is a very good idea. 
   
   What do people think about the approach described on  
https://github.com/apache/arrow-datafusion/pull/7944? I (admittedly biasedly) 
think that approach would eliminate almost all allocations (instead it would be 
ref count updates). We can extend it to incorporate ideas like pre-caching 
qualified names and hash sets for column checks, and I think it could be pretty 
fast


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Bad performance on wide tables (1000+ columns) [arrow-datafusion]

Reply via email to