[jira] [Comment Edited] (ARROW-10058) [C++] Investigate performance of LevelsToBitmap without BMI2

Antoine Pitrou (Jira) Mon, 28 Sep 2020 05:05:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203151#comment-17203151
 ]


Antoine Pitrou edited comment on ARROW-10058 at 9/28/20, 12:04 PM:
-------------------------------------------------------------------

Here is an updated patch with 5 bits lookup and direct popcount computation:
https://gist.github.com/pitrou/ee2cad31a7c1789fdd3d4d6db43d4a3f

This seems to give the best results on AMD Ryzen.
* before (git master):
{code}
BM_ReadListColumn/0     7206061 ns      7203281 ns           98 
bytes_per_second=1110.61M/s items_per_second=145.569M/s
BM_ReadListColumn/1     8530578 ns      8527627 ns           80 
bytes_per_second=938.127M/s items_per_second=122.962M/s
BM_ReadListColumn/50   15790410 ns     15786876 ns           44 
bytes_per_second=506.75M/s items_per_second=66.4207M/s
BM_ReadListColumn/99    6756996 ns      6754071 ns          102 
bytes_per_second=1.15671G/s items_per_second=155.251M/s

BM_DefinitionLevelsToBitmapRepeatedAllMissing         703 ns          703 ns    
   986029 bytes_per_second=2.71222G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent        2074 ns         2074 ns    
   336007 bytes_per_second=941.881M/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       2057 ns         2057 ns    
   340874 bytes_per_second=949.539M/s
{code}
* after:
{code}
BM_ReadListColumn/0     6807335 ns      6802293 ns          104 
bytes_per_second=1.14851G/s items_per_second=154.15M/s
BM_ReadListColumn/1     8011510 ns      8008278 ns           86 
bytes_per_second=998.966M/s items_per_second=130.937M/s
BM_ReadListColumn/50   12008336 ns     12005061 ns           58 
bytes_per_second=666.386M/s items_per_second=87.3445M/s
BM_ReadListColumn/99    5854171 ns      5851619 ns          115 
bytes_per_second=1.3351G/s items_per_second=179.194M/s

BM_DefinitionLevelsToBitmapRepeatedAllMissing         827 ns          826 ns    
   832857 bytes_per_second=2.30799G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent         932 ns          932 ns    
   752094 bytes_per_second=2.04596G/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       1531 ns         1531 ns    
   459072 bytes_per_second=1.24599G/s
{code}



was (Author: pitrou):
Here is an updated patch with 5 bits lookup and direct popcount computation:
https://gist.github.com/pitrou/ee2cad31a7c1789fdd3d4d6db43d4a3f

This seems to give the best results on AMD Ryzen.
* before:
{code}
BM_ReadListColumn/0     7206061 ns      7203281 ns           98 
bytes_per_second=1110.61M/s items_per_second=145.569M/s
BM_ReadListColumn/1     8530578 ns      8527627 ns           80 
bytes_per_second=938.127M/s items_per_second=122.962M/s
BM_ReadListColumn/50   15790410 ns     15786876 ns           44 
bytes_per_second=506.75M/s items_per_second=66.4207M/s
BM_ReadListColumn/99    6756996 ns      6754071 ns          102 
bytes_per_second=1.15671G/s items_per_second=155.251M/s

BM_DefinitionLevelsToBitmapRepeatedAllMissing         703 ns          703 ns    
   986029 bytes_per_second=2.71222G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent        2074 ns         2074 ns    
   336007 bytes_per_second=941.881M/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       2057 ns         2057 ns    
   340874 bytes_per_second=949.539M/s
{code}
* after:
{code}
BM_ReadListColumn/0     6807335 ns      6802293 ns          104 
bytes_per_second=1.14851G/s items_per_second=154.15M/s
BM_ReadListColumn/1     8011510 ns      8008278 ns           86 
bytes_per_second=998.966M/s items_per_second=130.937M/s
BM_ReadListColumn/50   12008336 ns     12005061 ns           58 
bytes_per_second=666.386M/s items_per_second=87.3445M/s
BM_ReadListColumn/99    5854171 ns      5851619 ns          115 
bytes_per_second=1.3351G/s items_per_second=179.194M/s

BM_DefinitionLevelsToBitmapRepeatedAllMissing         827 ns          826 ns    
   832857 bytes_per_second=2.30799G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent         932 ns          932 ns    
   752094 bytes_per_second=2.04596G/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       1531 ns         1531 ns    
   459072 bytes_per_second=1.24599G/s
{code}


> [C++] Investigate performance of LevelsToBitmap without BMI2
> ------------------------------------------------------------
>
>                 Key: ARROW-10058
>                 URL: https://issues.apache.org/jira/browse/ARROW-10058
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Antoine Pitrou
>            Priority: Major
>         Attachments: opt-level-conv.diff
>
>
> Currently, when some Parquet nested data involves some repetition levels, 
> converting the levels to bitmap goes through a slow scalar path unless the 
> BMI2 instruction set is available and efficient (the latter using the PEXT 
> instruction to process 16 levels at once).
> It may be possible to emulate PEXT for 5- or 6-bit masks by using a lookup 
> table, allowing to process 5-6 levels at once.
> (also, it would be good to add nested reading benchmarks for non-trivial 
> nesting; currently we only benchmark one-level struct and one-level list)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-10058) [C++] Investigate performance of LevelsToBitmap without BMI2

Reply via email to