[ 
https://issues.apache.org/jira/browse/PARQUET-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089332#comment-17089332
 ] 

Frank Du commented on PARQUET-1841:
-----------------------------------

Currently no dedicated benchmark for decoding/encoding spaced, I just create 
one.

[defalut sse build]

BM_PlainDecodingSpacedFloat/1024 877 ns 877 ns 796088 
bytes_per_second=4.35191G/s
BM_PlainDecodingSpacedFloat/4096 3358 ns 3354 ns 208737 
bytes_per_second=4.54884G/s
BM_PlainDecodingSpacedFloat/32768 26919 ns 26892 ns 26001 
bytes_per_second=4.5393G/s
BM_PlainDecodingSpacedFloat/65536 53955 ns 53898 ns 12994 
bytes_per_second=4.52972G/s

[AVX512 mask_expand intrinsic]

BM_PlainDecodingFloatSpaced/1024 210 ns 210 ns 3335080 
bytes_per_second=18.2024G/s
BM_PlainDecodingFloatSpaced/4096 622 ns 622 ns 1141673 
bytes_per_second=24.5412G/s
BM_PlainDecodingFloatSpaced/32768 5142 ns 5137 ns 136180 
bytes_per_second=23.7629G/s
BM_PlainDecodingFloatSpaced/65536 10294 ns 10283 ns 67707 
bytes_per_second=23.7423G/s

 

And for SIMD chance for SSE/AVX2, I don't find a path to speed just using 
shuffle/permute API. I got the perf hotspot, the top one is BitUtil::GetBit 
function. As the lack of mask_expand to use the valid_bits directly, we still 
need populate the mask for shuffle/permute API with loop all bits with GetBit 
API, thus no speed up available. 

> [C++] Experiment to see if using SIMD shuffle operations for DecodeSpaced 
> improves performance
> ----------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1841
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1841
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>         Attachments: image-2020-04-14-15-01-48-222.png
>
>
> Followup from PARQUET-1840 for current benchmarks it seems that doing 
> removing the memset somehow either has no impact or is slightly worse.  We 
> should investigate using SIMD operations to speed up spacing. 
>  
> As part of this we can see if moving the memset to only cover uninitialized 
> values after moving all required values provides any speedup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to