Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

via GitHub Fri, 12 Jan 2024 12:05:43 -0800


alamb commented on issue #5309:
URL: 
https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1889885280


   > * I profiled the benchmark for a simple query on a wide table (700 
columns) and a significant amount of the cpu time is  (~87%) is now coming from 
`has_column_with_qualified_name` (first screenshot below). 87% in the case of 
creating physical plan and 66% of creating unoptimized logical plan (second 
screenshot).
   > 
   > Given this seems to be hotspot for wide tables do you think best next step 
would be looking into improving lookup time by adding a btree (or whatever) or 
should we improve the foundation and work on updating the schema first? from 
what ive seen updating the schema may make adding the index easier so that may 
be a good start.
   
   Yes I agree getting DFSchema into better shape (e.g. not actually copying so 
many things) would likely make this task easier
   
   It also looks like `has_column_with_qualified_name` is always being called 
from `DFSchema::merge` I wonder if we can figure out why that needs to be 
called so much. My bet is that most of the callsites dont' actually add any new 
fields. Maybe we can quickly check if the pass didn't many any changes to the 
children, then there is no need to call DFSchema::merge
   
   Or maybe we can find some way to quickly compare if two schemas are the same 
🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Reply via email to