Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

via GitHub Fri, 12 Jan 2024 08:01:05 -0800


matthewmturner commented on issue #5309:
URL: 
https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1889561301


   @alamb 
   I've been looking into this more for places where we can replace unused 
results with booleans but nothing stuck out for that (let me know if you know 
or your intuition say otherwise).  I've also been using the great analysis from 
@zeodotr in 
https://github.com/apache/arrow-datafusion/issues/7698#issuecomment-1815885644 
to guide some of my review.
   
   A couple things:
   
   - I looked at optimization 6 from @zeodotr's list and I wasnt able to find 
`columnize_expr` as a hot spot in the context of creating physical plan (I 
tried reproducing on a wide table with several aggregates) which i believe is 
the use case they had (i didnt create 3000+ aggregates though like they have).  
it shows up as ~3% of cpu of creating unoptimized logical plan.
   - I profiled the benchmark for a simple query on a wide table (700 columns) 
and a significant amount of the cpu time is  (~87%) is now coming from 
`has_column_with_qualified_name` (first screenshot below). 87% in the case of 
creating physical plan and 66% of creating unoptimized logical plan (second 
screenshot).
   
   Given this seems to be hotspot for wide tables do you think best next step 
would be looking into improving lookup time by adding a btree or should we 
improve the foundation and work on updating the schema first?
   
   <img width="1728" alt="image" 
src="https://github.com/apache/arrow-datafusion/assets/22136083/d5892e02-68f4-4276-88c8-31587acdf4ee";>
   
   <img width="1728" alt="image" 
src="https://github.com/apache/arrow-datafusion/assets/22136083/cb3ea6e6-061c-4cf0-a753-661a88b37988";>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Reply via email to