[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369666#comment-16369666
 ] 

ASF GitHub Bot commented on ARROW-1942:
---------------------------------------

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-366852383
 
 
   Top level numbers OK to me:
   
   ```
   In [1]: import numpy as np
   
   In [2]: arr = np.random.randint(0, 200, size=10000000)
   
   In [3]: import pyarrow as pa
   
   In [4]: pa
   Out[4]: <module 'pyarrow' from 
'/home/wesm/code/arrow/python/pyarrow/__init__.py'>
   
   In [5]: parr = pa.array(arr)
   
   In [9]: timeit result = parr.unique()
   33.5 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [10]: import pandas as pd
   
   In [11]: timeit result2 = pd.unique(arr)
   25.7 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [12]: timeit result2 = np.unique(arr)
   296 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
   
   In [13]: parr_int8 = pa.array(arr.astype('int8'))
   
   In [14]: arr_int8 = arr.astype('int8')
   
   In [15]: timeit result = parr_int8.unique()
   10.1 ms ± 99.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   
   In [16]: timeit result = pd.unique(arr_int8)
   35.3 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [17]: timeit result = np.unique(arr_int8)
   282 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
   ```
   
   So we're about 30% slower than pandas for int64 at the moment (for this 
limited benchmark at least), which suggests plenty of room for improvement.
   
   Everything else looks good. +1, will merge on green build. Thanks 
@xuepanchen!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---------------------------------------------------
>
>                 Key: ARROW-1942
>                 URL: https://issues.apache.org/jira/browse/ARROW-1942
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Panchen Xue
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to