[
https://issues.apache.org/jira/browse/ARROW-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269706#comment-16269706
]
ASF GitHub Bot commented on ARROW-1844:
---------------------------------------
wesm opened a new pull request #1370: ARROW-1844: [C++] Add initial Unique
benchmarks for int64, variable-length strings
URL: https://github.com/apache/arrow/pull/1370
I also fixed a bug this surfaced in the hash table resize (unit test
coverage was not adequate)
Now we have
```
$ ./release/compute-benchmark
Run on (8 X 4174.84 MHz CPU s)
2017-11-28 18:18:26
Benchmark Time
CPU Iterations
-------------------------------------------------------------------------------------------------
BM_BuildDictionary/min_time:1.000 1451 us
1451 us 959 2.68974GB/s
BM_BuildStringDictionary/min_time:1.000 4005 us
4005 us 350 75.3785MB/s
BM_UniqueInt64NoNulls/16M/50/min_time:1.000/real_time 35940 us
35942 us 39 91.3192MB/s
BM_UniqueInt64NoNulls/16M/1024/min_time:1.000/real_time 120002 us
120006 us 12 88.8877MB/s
BM_UniqueInt64NoNulls/16M/10k/min_time:1.000/real_time 175855 us
175862 us 8 90.9838MB/s
BM_UniqueInt64NoNulls/16M/1024k/min_time:1.000/real_time 452242 us
452257 us 3 94.3449MB/s
BM_UniqueInt64WithNulls/16M/50/min_time:1.000/real_time 58632 us
58634 us 29 75.2797MB/s
BM_UniqueInt64WithNulls/16M/1024/min_time:1.000/real_time 134079 us
134084 us 10 95.4661MB/s
BM_UniqueInt64WithNulls/16M/10k/min_time:1.000/real_time 183846 us
183851 us 8 87.0295MB/s
BM_UniqueInt64WithNulls/16M/1024k/min_time:1.000/real_time 528790 us
528808 us 3 80.6873MB/s
BM_UniqueString10bytes/16M/50/min_time:1.000/real_time 152207 us
152212 us 9 116.8MB/s
BM_UniqueString10bytes/16M/1024/min_time:1.000/real_time 260047 us
260056 us 5 123.055MB/s
BM_UniqueString10bytes/16M/10k/min_time:1.000/real_time 426539 us
426552 us 3 125.038MB/s
BM_UniqueString10bytes/16M/1024k/min_time:1.000/real_time 1716739 us
1716791 us 1 93.2MB/s
BM_UniqueString100bytes/16M/50/min_time:1.000/real_time 556145 us
556165 us 3 958.982MB/s
BM_UniqueString100bytes/16M/1024/min_time:1.000/real_time 693922 us
693943 us 2 1.12585GB/s
BM_UniqueString100bytes/16M/10k/min_time:1.000/real_time 1000449 us
1000484 us 1 1.5618GB/s
BM_UniqueString100bytes/16M/1024k/min_time:1.000/real_time 3591215 us
3591314 us 1 445.532MB/s
```
This suggests quite a lot of room for improvement -- it's counter-intuitive
to me that hashing strings seems optically faster than hashing integers, so we
should figure out what's going on there.
We can also refactor the hash table implementations without worrying too
much about whether we're making things slower
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [C++] Basic benchmark suite for hash kernels
> --------------------------------------------
>
> Key: ARROW-1844
> URL: https://issues.apache.org/jira/browse/ARROW-1844
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Assignee: Wes McKinney
> Labels: pull-request-available
> Fix For: 0.8.0
>
>
> * Integers, small cardinality and large cardinality
> * Short strings, small/large cardinality
> * Long strings, small/large cardinality
> These benchmarks will enable us to refactor without fear, and to experiment
> with faster hash functions
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)