Re: [PR] Short string optimized`DistinctCountAccumulator` for string [arrow-datafusion]

via GitHub Sun, 14 Jan 2024 17:30:36 -0800


jayzhan211 commented on PR #8849:
URL: 
https://github.com/apache/arrow-datafusion/pull/8849#issuecomment-1891165240


   > > > I thought the more the small string is the more performance gains, but 
it shows that the more the long string is the better
   > > 
   > > 
   > > It seems to me this means that maybe the small string optimization is 
unnecessary at this time given it doesn't seem to make a significant different 
to performance 🤔
   > > Maybe we could simplify the code ?
   > 
   > If the number of rows is large > 1e6, then the speed gains of short 
strings is larger than seconds (5s faster for n=1e6)
   
   hits.parquet data is not large enough
   ```
   SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), 
COUNT(DISTINCT "BrowserLanguage")  FROM hits
   ```
   where they are either len 1 or 2. This does not show difference.
   
   ```
   SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), 
COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "URL")  FROM hits
   ```
   URL length is mostly > 8. Improve from 11s to 9s
   
   
   ```
   Query 0 iteration 0 took 11751.8 ms and returned 1 rows
   Query 0 iteration 1 took 11154.3 ms and returned 1 rows
   Query 0 iteration 2 took 10434.3 ms and returned 1 rows
   Query 0 iteration 3 took 10988.1 ms and returned 1 rows
   Query 0 iteration 4 took 12159.3 ms and returned 1 rows
   ```
   
   ```
   Query 0 iteration 0 took 9415.5 ms and returned 1 rows
   Query 0 iteration 1 took 9009.4 ms and returned 1 rows
   Query 0 iteration 2 took 9832.9 ms and returned 1 rows
   Query 0 iteration 3 took 10004.6 ms and returned 1 rows
   Query 0 iteration 4 took 9829.1 ms and returned 1 rows
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Short string optimized`DistinctCountAccumulator` for string [arrow-datafusion]

Reply via email to