[ https://issues.apache.org/jira/browse/HBASE-28256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795791#comment-17795791 ]
Becker Ewing commented on HBASE-28256: -------------------------------------- I wrote a JMH benchmarking suite (& attached the code to the JIRA) that measures the performance of reading various vLongs from both on & off heap buffers with and without padding (padding meaning that the buffer is at least 9 bytes long so that the vectorized read path can be applied). That is to say, for the unoptimized readVLong path, padding shouldn't incur any performance penalty, but with the optimized method only the padded benchmarks should show a substantial performance improvement (and they do show a pretty nice performance improvement). {noformat} Benchmark (vint) Mode Cnt Score Error Units ReadVLongBenchmark.readVLong_OffHeapBB 9 avgt 5 4.643 ± 2.917 ns/op ReadVLongBenchmark.readVLong_OffHeapBB 512 avgt 5 8.063 ± 0.187 ns/op ReadVLongBenchmark.readVLong_OffHeapBB 2146483640 avgt 5 11.999 ± 0.314 ns/op ReadVLongBenchmark.readVLong_OffHeapBB 1700104028981 avgt 5 14.880 ± 0.698 ns/op ReadVLongBenchmark.readVLong_OffHeapBB_Padded 9 avgt 5 4.233 ± 0.136 ns/op ReadVLongBenchmark.readVLong_OffHeapBB_Padded 512 avgt 5 7.986 ± 0.048 ns/op ReadVLongBenchmark.readVLong_OffHeapBB_Padded 2146483640 avgt 5 12.014 ± 0.012 ns/op ReadVLongBenchmark.readVLong_OffHeapBB_Padded 1700104028981 avgt 5 14.655 ± 2.216 ns/op ReadVLongBenchmark.readVLong_OnHeapBB 9 avgt 5 4.639 ± 0.012 ns/op ReadVLongBenchmark.readVLong_OnHeapBB 512 avgt 5 9.771 ± 0.357 ns/op ReadVLongBenchmark.readVLong_OnHeapBB 2146483640 avgt 5 13.928 ± 0.557 ns/op ReadVLongBenchmark.readVLong_OnHeapBB 1700104028981 avgt 5 17.487 ± 4.527 ns/op ReadVLongBenchmark.readVLong_OnHeapBB_Padded 9 avgt 5 5.245 ± 0.019 ns/op ReadVLongBenchmark.readVLong_OnHeapBB_Padded 512 avgt 5 10.086 ± 0.317 ns/op ReadVLongBenchmark.readVLong_OnHeapBB_Padded 2146483640 avgt 5 13.764 ± 0.100 ns/op ReadVLongBenchmark.readVLong_OnHeapBB_Padded 1700104028981 avgt 5 17.200 ± 0.913 ns/op ReadVLongBenchmark.readVLong_Optimized_OffHeapBB 9 avgt 5 4.258 ± 0.012 ns/op ReadVLongBenchmark.readVLong_Optimized_OffHeapBB 512 avgt 5 8.621 ± 0.339 ns/op ReadVLongBenchmark.readVLong_Optimized_OffHeapBB 2146483640 avgt 5 12.481 ± 2.609 ns/op ReadVLongBenchmark.readVLong_Optimized_OffHeapBB 1700104028981 avgt 5 14.211 ± 0.041 ns/op ReadVLongBenchmark.readVLong_Optimized_OffHeapBB_Padded 9 avgt 5 4.222 ± 0.007 ns/op ReadVLongBenchmark.readVLong_Optimized_OffHeapBB_Padded 512 avgt 5 8.830 ± 0.022 ns/op ReadVLongBenchmark.readVLong_Optimized_OffHeapBB_Padded 2146483640 avgt 5 8.998 ± 1.280 ns/op ReadVLongBenchmark.readVLong_Optimized_OffHeapBB_Padded 1700104028981 avgt 5 8.850 ± 0.047 ns/op ReadVLongBenchmark.readVLong_Optimized_OnHeapBB 9 avgt 5 4.751 ± 0.732 ns/op ReadVLongBenchmark.readVLong_Optimized_OnHeapBB 512 avgt 5 10.575 ± 0.024 ns/op ReadVLongBenchmark.readVLong_Optimized_OnHeapBB 2146483640 avgt 5 14.231 ± 0.385 ns/op ReadVLongBenchmark.readVLong_Optimized_OnHeapBB 1700104028981 avgt 5 17.252 ± 0.064 ns/op ReadVLongBenchmark.readVLong_Optimized_OnHeapBB_Padded 9 avgt 5 4.680 ± 0.108 ns/op ReadVLongBenchmark.readVLong_Optimized_OnHeapBB_Padded 512 avgt 5 9.719 ± 1.401 ns/op ReadVLongBenchmark.readVLong_Optimized_OnHeapBB_Padded 2146483640 avgt 5 9.511 ± 0.219 ns/op ReadVLongBenchmark.readVLong_Optimized_OnHeapBB_Padded 1700104028981 avgt 5 9.464 ± 0.019 ns/op{noformat} The benchmark that is most applicable to how the HBase code is going to see a performance improvement in is the readVLong_*_\{On|Off}HeapBB_Padded benchmarks for 1700104028981 which is the most similar vLong to a memstoreTs that is being decoded from a block in the BlockCache (w/ any DBE). In terms of how this microbenchmark translates to better seek performance, I'm seeing a consistent 20% performance improvement in the {{TestDataBlockEncoders}} test performance with this patch vs. without this patch. > Enhance ByteBufferUtils.readVLong to read 8 bytes at a time > ----------------------------------------------------------- > > Key: HBASE-28256 > URL: https://issues.apache.org/jira/browse/HBASE-28256 > Project: HBase > Issue Type: Improvement > Components: Performance > Reporter: Becker Ewing > Assignee: Becker Ewing > Priority: Major > Attachments: ReadVLongBenchmark.zip, async-prof-rs-cpu.html > > > Currently, ByteBufferUtils.readVLong is used to decode rows in all data block > encodings in order to read the memstoreTs field. For a data block encoding > like prefix, ByteBufferUtils.readVLong can surprisingly occupy over 50% of > the CPU time in BufferedEncodedSeeker.decodeNext (which can be quite a hot > method in seek operations). > > Since memstoreTs will typically require at least 6 bytes to store, we could > look to vectorize the read path for readVLong to read 8 bytes at a time > instead of a single byte at a time (like in > https://issues.apache.org/jira/browse/HBASE-28025) in order to increase > performance. > > Attached is a CPU flamegraph of a region server process which shows that we > spend a surprising amount of time in decoding rows from the DBE in > ByteBufferUtils.readVLong. -- This message was sent by Atlassian Jira (v8.20.10#820010)