alamb commented on PR #12816:
URL: https://github.com/apache/datafusion/pull/12816#issuecomment-2411137742

   > > Here is the performance of this PR. Some queries are slower, some are 
faster.
   > > I believe once we turn on string view everything will be faster.
   > 
   > Thanks @alamb It's interesting 🤔 Does this benchmark only include the 
change made by this PR, or does it include others? It seems there are many 
queries slowed down by this PR.
   
   It only includes changes made by this PR 
   
   The results with several other changes are here: 
https://github.com/apache/datafusion/pull/12092#issuecomment-2410952786 (and 
they are all faster 🎉 )
   
   > 
   > Before this PR, the casting flow is
   > 
   > ```
   > Binary(parquet) -> Binary(arrow) -> BinaryView(arrow) -> StringView(arrow)
   > ```
   > 
   > Now, it's
   > 
   > ```
   > Binary(paruqet) -> StringView(arrow)
   > ```
   > 
   > Theoretically, we save the two steps (including the most expensive ones) 
for it. I have no idea why they would be slower. I might try to do some 
profiling for the slower cases 🤔
   
   I think the reason it is slower is that there are some operations in the 
hash grouping code that have specializations for `StringArray`/`BinaryArray` 
but do not (yet) have specializations for StringView.  Specifically
   
   - https://github.com/apache/datafusion/pull/12792
   - https://github.com/apache/datafusion/pull/12809 from @Rachelint
   
   So while this PR makes the scan faster, the total time is slower as those 
paths dominated the query path. 
   
   When they are all put together we get the speedup we have been looking for


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to