alamb commented on issue #5309: URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1889885280
> * I profiled the benchmark for a simple query on a wide table (700 columns) and a significant amount of the cpu time is (~87%) is now coming from `has_column_with_qualified_name` (first screenshot below). 87% in the case of creating physical plan and 66% of creating unoptimized logical plan (second screenshot). > > Given this seems to be hotspot for wide tables do you think best next step would be looking into improving lookup time by adding a btree (or whatever) or should we improve the foundation and work on updating the schema first? from what ive seen updating the schema may make adding the index easier so that may be a good start. Yes I agree getting DFSchema into better shape (e.g. not actually copying so many things) would likely make this task easier It also looks like `has_column_with_qualified_name` is always being called from `DFSchema::merge` I wonder if we can figure out why that needs to be called so much. My bet is that most of the callsites dont' actually add any new fields. Maybe we can quickly check if the pass didn't many any changes to the children, then there is no need to call DFSchema::merge Or maybe we can find some way to quickly compare if two schemas are the same 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
