mapleFU commented on issue #40845:
URL: https://github.com/apache/arrow/issues/40845#issuecomment-2041524939
I write a naive bmi2 impl, in Intel Xeon:
```
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1
4729 ns 4734 ns 149748 bytes_per_second=3.18578Gi/s
items_per_second=1.71035G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7
14434 ns 14453 ns 48206 bytes_per_second=1.04339Gi/s
items_per_second=560.166M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024
2763 ns 2769 ns 251599 bytes_per_second=5.44634Gi/s
items_per_second=2.92398G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1
4311 ns 4316 ns 162065 bytes_per_second=3.49358Gi/s
items_per_second=1.8756G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1
4122 ns 4125 ns 169853 bytes_per_second=3.65559Gi/s
items_per_second=1.96258G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1
3886 ns 3888 ns 179847 bytes_per_second=3.8783Gi/s
items_per_second=2.08215G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7
13068 ns 13080 ns 53376 bytes_per_second=1.15291Gi/s
items_per_second=618.963M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1
3742 ns 3748 ns 186340 bytes_per_second=4.02357Gi/s
items_per_second=2.16014G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7
3745 ns 3750 ns 186374 bytes_per_second=4.02114Gi/s
items_per_second=2.15883G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024
3742 ns 3747 ns 186728 bytes_per_second=4.02498Gi/s
items_per_second=2.1609G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1
3424 ns 3429 ns 204069 bytes_per_second=4.39745Gi/s
items_per_second=2.36086G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1
3499 ns 3504 ns 199696 bytes_per_second=4.30356Gi/s
items_per_second=2.31045G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1
3311 ns 3315 ns 208379 bytes_per_second=4.54882Gi/s
items_per_second=2.44213G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7
3499 ns 3506 ns 199599 bytes_per_second=4.3016Gi/s
items_per_second=2.3094G/s
```
Before:
```
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1
5534 ns 5541 ns 124334 bytes_per_second=2.7213Gi/s
items_per_second=1.46099G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7
16683 ns 16703 ns 42139 bytes_per_second=924.523Mi/s
items_per_second=484.716M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024
2594 ns 2597 ns 269554 bytes_per_second=5.80692Gi/s
items_per_second=3.11757G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1
4980 ns 4985 ns 140579 bytes_per_second=3.02534Gi/s
items_per_second=1.62421G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1
4782 ns 4786 ns 146402 bytes_per_second=3.15082Gi/s
items_per_second=1.69158G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1
4397 ns 4402 ns 159011 bytes_per_second=3.42562Gi/s
items_per_second=1.83912G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7
15721 ns 15736 ns 44453 bytes_per_second=981.292Mi/s
items_per_second=514.48M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1
3272 ns 3274 ns 213994 bytes_per_second=4.60583Gi/s
items_per_second=2.47273G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7
3272 ns 3273 ns 213769 bytes_per_second=4.60763Gi/s
items_per_second=2.4737G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024
3272 ns 3272 ns 213713 bytes_per_second=4.60907Gi/s
items_per_second=2.47447G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1
3270 ns 3273 ns 213934 bytes_per_second=4.60708Gi/s
items_per_second=2.47341G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1
3314 ns 3321 ns 210344 bytes_per_second=4.5405Gi/s
items_per_second=2.43766G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1
3273 ns 3279 ns 213189 bytes_per_second=4.5986Gi/s
items_per_second=2.46885G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7
3310 ns 3318 ns 211121 bytes_per_second=4.54534Gi/s
items_per_second=2.44026G/s
```
In the senerio of Rle Read levels, performance grows faster, but in BitPack,
it even grows slower. I guess it could benifit performance when number of input
is small
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]