Re: [I] Bad performance on wide tables (1000+ columns) [arrow-datafusion]

via GitHub Thu, 26 Oct 2023 12:40:12 -0700


alamb commented on issue #7698:
URL: 
https://github.com/apache/arrow-datafusion/issues/7698#issuecomment-1781787244


   I have reviewed https://github.com/apache/arrow-datafusion/pull/7870 and 
https://github.com/apache/arrow-datafusion/pull/7878
   
   Here are my thoughts:
   1. I think some sort of performance benchmark results to know how much it is 
helps / hurts in in other areas (like how much longer it takes to create one). 
Can someone please create some benchmarks, similar to  
[scalar.rs](https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/benches/scalar.rs)
 for `index_of_column_by_name `  and schema creation?
   2. I think it is likely to be too expensive to build a `HashMap` with each 
DFSchema (as it is creating / copying owned strings) if that never is read -- I 
think it should be built on demand, as suggested by @crepererum at 
https://github.com/apache/arrow-datafusion/pull/7870/files#r1372786446
   3. I have been long bothered by how expensive it is to create a DFSchema. I 
have some ideas on how to make it faster to construct -- which might not help 
this usecase directly I think it might help planning in general. I will take a 
crack at working on this idea


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Bad performance on wide tables (1000+ columns) [arrow-datafusion]

Reply via email to