[jira] [Commented] (HBASE-28256) Enhance ByteBufferUtils.readVLong to read 8 bytes at a time

Becker Ewing (Jira) Tue, 12 Dec 2023 06:50:16 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-28256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795791#comment-17795791
 ]


Becker Ewing commented on HBASE-28256:
--------------------------------------

I wrote a JMH benchmarking suite (& attached the code to the JIRA) that 
measures the performance of reading various vLongs from both on & off heap 
buffers with and without padding (padding meaning that the buffer is at least 9 
bytes long so that the vectorized read path can be applied). That is to say, 
for the unoptimized readVLong path, padding shouldn't incur any performance 
penalty, but with the optimized method only the padded benchmarks should show a 
substantial performance improvement (and they do show a pretty nice performance 
improvement).

 
{noformat}
Benchmark                                                       (vint)  Mode  
Cnt   Score   Error  Units
ReadVLongBenchmark.readVLong_OffHeapBB                               9  avgt    
5   4.643 ± 2.917  ns/op
ReadVLongBenchmark.readVLong_OffHeapBB                             512  avgt    
5   8.063 ± 0.187  ns/op
ReadVLongBenchmark.readVLong_OffHeapBB                      2146483640  avgt    
5  11.999 ± 0.314  ns/op
ReadVLongBenchmark.readVLong_OffHeapBB                   1700104028981  avgt    
5  14.880 ± 0.698  ns/op
ReadVLongBenchmark.readVLong_OffHeapBB_Padded                        9  avgt    
5   4.233 ± 0.136  ns/op
ReadVLongBenchmark.readVLong_OffHeapBB_Padded                      512  avgt    
5   7.986 ± 0.048  ns/op
ReadVLongBenchmark.readVLong_OffHeapBB_Padded               2146483640  avgt    
5  12.014 ± 0.012  ns/op
ReadVLongBenchmark.readVLong_OffHeapBB_Padded            1700104028981  avgt    
5  14.655 ± 2.216  ns/op
ReadVLongBenchmark.readVLong_OnHeapBB                                9  avgt    
5   4.639 ± 0.012  ns/op
ReadVLongBenchmark.readVLong_OnHeapBB                              512  avgt    
5   9.771 ± 0.357  ns/op
ReadVLongBenchmark.readVLong_OnHeapBB                       2146483640  avgt    
5  13.928 ± 0.557  ns/op
ReadVLongBenchmark.readVLong_OnHeapBB                    1700104028981  avgt    
5  17.487 ± 4.527  ns/op
ReadVLongBenchmark.readVLong_OnHeapBB_Padded                         9  avgt    
5   5.245 ± 0.019  ns/op
ReadVLongBenchmark.readVLong_OnHeapBB_Padded                       512  avgt    
5  10.086 ± 0.317  ns/op
ReadVLongBenchmark.readVLong_OnHeapBB_Padded                2146483640  avgt    
5  13.764 ± 0.100  ns/op
ReadVLongBenchmark.readVLong_OnHeapBB_Padded             1700104028981  avgt    
5  17.200 ± 0.913  ns/op
ReadVLongBenchmark.readVLong_Optimized_OffHeapBB                     9  avgt    
5   4.258 ± 0.012  ns/op
ReadVLongBenchmark.readVLong_Optimized_OffHeapBB                   512  avgt    
5   8.621 ± 0.339  ns/op
ReadVLongBenchmark.readVLong_Optimized_OffHeapBB            2146483640  avgt    
5  12.481 ± 2.609  ns/op
ReadVLongBenchmark.readVLong_Optimized_OffHeapBB         1700104028981  avgt    
5  14.211 ± 0.041  ns/op
ReadVLongBenchmark.readVLong_Optimized_OffHeapBB_Padded              9  avgt    
5   4.222 ± 0.007  ns/op
ReadVLongBenchmark.readVLong_Optimized_OffHeapBB_Padded            512  avgt    
5   8.830 ± 0.022  ns/op
ReadVLongBenchmark.readVLong_Optimized_OffHeapBB_Padded     2146483640  avgt    
5   8.998 ± 1.280  ns/op
ReadVLongBenchmark.readVLong_Optimized_OffHeapBB_Padded  1700104028981  avgt    
5   8.850 ± 0.047  ns/op
ReadVLongBenchmark.readVLong_Optimized_OnHeapBB                      9  avgt    
5   4.751 ± 0.732  ns/op
ReadVLongBenchmark.readVLong_Optimized_OnHeapBB                    512  avgt    
5  10.575 ± 0.024  ns/op
ReadVLongBenchmark.readVLong_Optimized_OnHeapBB             2146483640  avgt    
5  14.231 ± 0.385  ns/op
ReadVLongBenchmark.readVLong_Optimized_OnHeapBB          1700104028981  avgt    
5  17.252 ± 0.064  ns/op
ReadVLongBenchmark.readVLong_Optimized_OnHeapBB_Padded               9  avgt    
5   4.680 ± 0.108  ns/op
ReadVLongBenchmark.readVLong_Optimized_OnHeapBB_Padded             512  avgt    
5   9.719 ± 1.401  ns/op
ReadVLongBenchmark.readVLong_Optimized_OnHeapBB_Padded      2146483640  avgt    
5   9.511 ± 0.219  ns/op
ReadVLongBenchmark.readVLong_Optimized_OnHeapBB_Padded   1700104028981  avgt    
5   9.464 ± 0.019  ns/op{noformat}
 

 
The benchmark that is most applicable to how the HBase code is going to see a 
performance improvement in is the readVLong_*_\{On|Off}HeapBB_Padded benchmarks 
for 1700104028981 which is the most similar vLong to a memstoreTs that is being 
decoded from a block in the BlockCache (w/ any DBE).
 
In terms of how this microbenchmark translates to better seek performance, I'm 
seeing a consistent 20% performance improvement in the 
{{TestDataBlockEncoders}} test performance with this patch vs. without this 
patch.

> Enhance ByteBufferUtils.readVLong to read 8 bytes at a time
> -----------------------------------------------------------
>
>                 Key: HBASE-28256
>                 URL: https://issues.apache.org/jira/browse/HBASE-28256
>             Project: HBase
>          Issue Type: Improvement
>          Components: Performance
>            Reporter: Becker Ewing
>            Assignee: Becker Ewing
>            Priority: Major
>         Attachments: ReadVLongBenchmark.zip, async-prof-rs-cpu.html
>
>
> Currently, ByteBufferUtils.readVLong is used to decode rows in all data block 
> encodings in order to read the memstoreTs field. For a data block encoding 
> like prefix, ByteBufferUtils.readVLong can surprisingly occupy over 50% of 
> the CPU time in BufferedEncodedSeeker.decodeNext (which can be quite a hot 
> method in seek operations).
>  
> Since memstoreTs will typically require at least 6 bytes to store, we could 
> look to vectorize the read path for readVLong to read 8 bytes at a time 
> instead of a single byte at a time (like in 
> https://issues.apache.org/jira/browse/HBASE-28025) in order to increase 
> performance.
>  
> Attached is a CPU flamegraph of a region server process which shows that we 
> spend a surprising amount of time in decoding rows from the DBE in 
> ByteBufferUtils.readVLong.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-28256) Enhance ByteBufferUtils.readVLong to read 8 bytes at a time

Reply via email to