[
https://issues.apache.org/jira/browse/ARROW-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203151#comment-17203151
]
Antoine Pitrou edited comment on ARROW-10058 at 9/28/20, 12:04 PM:
-------------------------------------------------------------------
Here is an updated patch with 5 bits lookup and direct popcount computation:
https://gist.github.com/pitrou/ee2cad31a7c1789fdd3d4d6db43d4a3f
This seems to give the best results on AMD Ryzen.
* before (git master):
{code}
BM_ReadListColumn/0 7206061 ns 7203281 ns 98
bytes_per_second=1110.61M/s items_per_second=145.569M/s
BM_ReadListColumn/1 8530578 ns 8527627 ns 80
bytes_per_second=938.127M/s items_per_second=122.962M/s
BM_ReadListColumn/50 15790410 ns 15786876 ns 44
bytes_per_second=506.75M/s items_per_second=66.4207M/s
BM_ReadListColumn/99 6756996 ns 6754071 ns 102
bytes_per_second=1.15671G/s items_per_second=155.251M/s
BM_DefinitionLevelsToBitmapRepeatedAllMissing 703 ns 703 ns
986029 bytes_per_second=2.71222G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent 2074 ns 2074 ns
336007 bytes_per_second=941.881M/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent 2057 ns 2057 ns
340874 bytes_per_second=949.539M/s
{code}
* after:
{code}
BM_ReadListColumn/0 6807335 ns 6802293 ns 104
bytes_per_second=1.14851G/s items_per_second=154.15M/s
BM_ReadListColumn/1 8011510 ns 8008278 ns 86
bytes_per_second=998.966M/s items_per_second=130.937M/s
BM_ReadListColumn/50 12008336 ns 12005061 ns 58
bytes_per_second=666.386M/s items_per_second=87.3445M/s
BM_ReadListColumn/99 5854171 ns 5851619 ns 115
bytes_per_second=1.3351G/s items_per_second=179.194M/s
BM_DefinitionLevelsToBitmapRepeatedAllMissing 827 ns 826 ns
832857 bytes_per_second=2.30799G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent 932 ns 932 ns
752094 bytes_per_second=2.04596G/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent 1531 ns 1531 ns
459072 bytes_per_second=1.24599G/s
{code}
was (Author: pitrou):
Here is an updated patch with 5 bits lookup and direct popcount computation:
https://gist.github.com/pitrou/ee2cad31a7c1789fdd3d4d6db43d4a3f
This seems to give the best results on AMD Ryzen.
* before:
{code}
BM_ReadListColumn/0 7206061 ns 7203281 ns 98
bytes_per_second=1110.61M/s items_per_second=145.569M/s
BM_ReadListColumn/1 8530578 ns 8527627 ns 80
bytes_per_second=938.127M/s items_per_second=122.962M/s
BM_ReadListColumn/50 15790410 ns 15786876 ns 44
bytes_per_second=506.75M/s items_per_second=66.4207M/s
BM_ReadListColumn/99 6756996 ns 6754071 ns 102
bytes_per_second=1.15671G/s items_per_second=155.251M/s
BM_DefinitionLevelsToBitmapRepeatedAllMissing 703 ns 703 ns
986029 bytes_per_second=2.71222G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent 2074 ns 2074 ns
336007 bytes_per_second=941.881M/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent 2057 ns 2057 ns
340874 bytes_per_second=949.539M/s
{code}
* after:
{code}
BM_ReadListColumn/0 6807335 ns 6802293 ns 104
bytes_per_second=1.14851G/s items_per_second=154.15M/s
BM_ReadListColumn/1 8011510 ns 8008278 ns 86
bytes_per_second=998.966M/s items_per_second=130.937M/s
BM_ReadListColumn/50 12008336 ns 12005061 ns 58
bytes_per_second=666.386M/s items_per_second=87.3445M/s
BM_ReadListColumn/99 5854171 ns 5851619 ns 115
bytes_per_second=1.3351G/s items_per_second=179.194M/s
BM_DefinitionLevelsToBitmapRepeatedAllMissing 827 ns 826 ns
832857 bytes_per_second=2.30799G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent 932 ns 932 ns
752094 bytes_per_second=2.04596G/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent 1531 ns 1531 ns
459072 bytes_per_second=1.24599G/s
{code}
> [C++] Investigate performance of LevelsToBitmap without BMI2
> ------------------------------------------------------------
>
> Key: ARROW-10058
> URL: https://issues.apache.org/jira/browse/ARROW-10058
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: C++
> Reporter: Antoine Pitrou
> Priority: Major
> Attachments: opt-level-conv.diff
>
>
> Currently, when some Parquet nested data involves some repetition levels,
> converting the levels to bitmap goes through a slow scalar path unless the
> BMI2 instruction set is available and efficient (the latter using the PEXT
> instruction to process 16 levels at once).
> It may be possible to emulate PEXT for 5- or 6-bit masks by using a lookup
> table, allowing to process 5-6 levels at once.
> (also, it would be good to add nested reading benchmarks for non-trivial
> nesting; currently we only benchmark one-level struct and one-level list)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)