alamb commented on PR #12816: URL: https://github.com/apache/datafusion/pull/12816#issuecomment-2411137742
> > Here is the performance of this PR. Some queries are slower, some are faster. > > I believe once we turn on string view everything will be faster. > > Thanks @alamb It's interesting 🤔 Does this benchmark only include the change made by this PR, or does it include others? It seems there are many queries slowed down by this PR. It only includes changes made by this PR The results with several other changes are here: https://github.com/apache/datafusion/pull/12092#issuecomment-2410952786 (and they are all faster 🎉 ) > > Before this PR, the casting flow is > > ``` > Binary(parquet) -> Binary(arrow) -> BinaryView(arrow) -> StringView(arrow) > ``` > > Now, it's > > ``` > Binary(paruqet) -> StringView(arrow) > ``` > > Theoretically, we save the two steps (including the most expensive ones) for it. I have no idea why they would be slower. I might try to do some profiling for the slower cases 🤔 I think the reason it is slower is that there are some operations in the hash grouping code that have specializations for `StringArray`/`BinaryArray` but do not (yet) have specializations for StringView. Specifically - https://github.com/apache/datafusion/pull/12792 - https://github.com/apache/datafusion/pull/12809 from @Rachelint So while this PR makes the scan faster, the total time is slower as those paths dominated the query path. When they are all put together we get the speedup we have been looking for -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org