jayzhan211 commented on PR #8849: URL: https://github.com/apache/arrow-datafusion/pull/8849#issuecomment-1891165240
> > > I thought the more the small string is the more performance gains, but it shows that the more the long string is the better > > > > > > It seems to me this means that maybe the small string optimization is unnecessary at this time given it doesn't seem to make a significant different to performance 🤔 > > Maybe we could simplify the code ? > > If the number of rows is large > 1e6, then the speed gains of short strings is larger than seconds (5s faster for n=1e6) hits.parquet data is not large enough ``` SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage") FROM hits ``` where they are either len 1 or 2. This does not show difference. ``` SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "URL") FROM hits ``` URL length is mostly > 8. Improve from 11s to 9s ``` Query 0 iteration 0 took 11751.8 ms and returned 1 rows Query 0 iteration 1 took 11154.3 ms and returned 1 rows Query 0 iteration 2 took 10434.3 ms and returned 1 rows Query 0 iteration 3 took 10988.1 ms and returned 1 rows Query 0 iteration 4 took 12159.3 ms and returned 1 rows ``` ``` Query 0 iteration 0 took 9415.5 ms and returned 1 rows Query 0 iteration 1 took 9009.4 ms and returned 1 rows Query 0 iteration 2 took 9832.9 ms and returned 1 rows Query 0 iteration 3 took 10004.6 ms and returned 1 rows Query 0 iteration 4 took 9829.1 ms and returned 1 rows ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org