Alexey Serbin has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/21499


Change subject: [ARM] Concurrent binary tree memory barriers fixed.
......................................................................

[ARM] Concurrent binary tree memory barriers fixed.

TestCBTree.TestRacyConcurrentInsert sometimes fails on ARM.
The concurrent inserts produce a tree with some incorrectly ordered
nodes. Apart from incorrect order, there are no other errors in the
tree. All inserted elements are present, but CBTree::GetCopy cannot
find some due to the incorrect ordering.

This is not a unit test error, but a real error in the CBTree
implementation. Modifying the test to only do the inserts first, and
only start checking when the tree is finalized does not prevent the bug.
So calling only the tree->Insert function from multiple threads is
sufficient to reproduce the issue (only on ARM).

Root cause:

Some memory order restrictions are not strict enough. It does not cause
a problem on x86 (There were no real changes for 11 years), but it
causes problems on ARM. x86 guarantees a globally consistent order for
stores (TSO).
ARM, in contrast, allows stores to different memory locations to be
observed differently from the program order.
More info:
https://www.sciencedirect.com/science/article/pii/S1383762124000390

Solution:

The following barriers need to be more strict:

1. When we set the Splitting/Inserting flag on a node, then it is not
  enough to flush all previous changes. It is also very important not
  to reorder any write before it. So instead of Release_Store (which
  is practically equivalent to std::memory_order_release), we need a
  2 way barrier.

2. When we call AcquireVersion or StableVersion to verify that the
  version was not changed during our (already completed) read, a 2-way
  std::memory_order_acquire is required.

Putting the appropriate std::atomic_thread_fence(...) calls to these
places would resolve the issue. However, replacing the current function
from atomicops-internals-arm64.h to the C++ standard functions will
make the code more consistent.

Reason for changing to std::atomics:

atomicops-internals-arm64.h was picked from chromium source and the
missing functions were reimplemented. The header is gone since, and even
chromium uses a solution based on std::atomic (see
base/atomicops_internals_portable.h in chromium source). I see no reason
to update the header from chromium and implement the missing functions,
just to have one more abstraction layer, that is basically just
"function aliases" at this point.

Nodes are PACKED, which conflicts with std::atomic. However, when we
allocate a node, it is allocated with sufficient alignment for atomic
operations. Other unaligned structures (the values of the data) are
stored between nodes. Removing PACKED would increase the memory
footprint and actually slow things down slightly.

Notes:
+ std::atomic::<operation> usually blocks reordering in one direction.
  std::atomic_thread_fence(...) blocks reordering in both directions.

- std::assume_aligned is from C++20. The used __builtin_assume_aligned
  should be supported by both Clang and GCC.

- I renamed `IsLocked` to `IsLockedUnsafe` because it should only be
  used by the thread that actually holds the lock (as it is currently
  only used in `DCHECK` macros when we hold the lock).

Performance Tests on an aws c6i.2xlarge :

=============== master =============

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*TestRacyConcurrentInsert* --gtest_repeat=10':

        1033864.95 msec task-clock                #    7.703 CPUs utilized
            788348      context-switches          #    0.763 K/sec
            142572      cpu-migrations            #    0.138 K/sec
              3853      page-faults               #    0.004 K/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

     134.221318612 seconds time elapsed

    1036.566723000 seconds user
       0.369948000 seconds sys

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*ConcurrentReadWriteBenchmark* --gtest_repeat=10 
--concurrent_rw_benchmark_num_writer_threads=4 
--concurrent_rw_benchmark_num_reader_threads=4':

         175292.55 msec task-clock                #    5.723 CPUs utilized
             35287      context-switches          #    0.201 K/sec
              1209      cpu-migrations            #    0.007 K/sec
           1990806      page-faults               #    0.011 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      30.627830389 seconds time elapsed

     172.712266000 seconds user
       2.846389000 seconds sys

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*ConcurrentReadWriteBenchmark* --gtest_repeat=10 
--concurrent_rw_benchmark_num_writer_threads=16 
--concurrent_rw_benchmark_num_reader_threads=16':

         510151.64 msec task-clock                #    7.777 CPUs utilized
            174800      context-switches          #    0.343 K/sec
             18371      cpu-migrations            #    0.036 K/sec
           2419870      page-faults               #    0.005 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      65.595733846 seconds time elapsed

     507.753702000 seconds user
       2.711470000 seconds sys

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*ConcurrentReadWriteBenchmark* --gtest_repeat=10 
--concurrent_rw_benchmark_num_writer_threads=4 
--concurrent_rw_benchmark_num_reader_threads=16 
--concurrent_rw_benchmark_reader_boost=20':

        1382976.80 msec task-clock                #    7.493 CPUs utilized
            113151      context-switches          #    0.082 K/sec
              8956      cpu-migrations            #    0.006 K/sec
           1719478      page-faults               #    0.001 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

     184.561325426 seconds time elapsed

    1380.544067000 seconds user
       2.527816000 seconds sys

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*TestScanPerformance* --gtest_repeat=10':

          46673.48 msec task-clock                #    1.000 CPUs utilized
                82      context-switches          #    0.002 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             82378      page-faults               #    0.002 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      46.692262861 seconds time elapsed

      46.585368000 seconds user
       0.088011000 seconds sys

 Performance counter stats for 'bin/memrowset-test 
--gtest_filter=*TestMemRowSetUpdatePerformance* --gtest_repeat=10':

          15982.29 msec task-clock                #    0.999 CPUs utilized
                67      context-switches          #    0.004 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             73407      page-faults               #    0.005 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      16.006175617 seconds time elapsed

      15.842251000 seconds user
       0.139984000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkManyUpdatesToOneRow* --gtest_repeat=10':

            499.92 msec task-clock                #    0.963 CPUs utilized
                 1      context-switches          #    0.002 K/sec
                 0      cpu-migrations            #    0.000 K/sec
              4555      page-faults               #    0.009 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

       0.519157362 seconds time elapsed

       0.492121000 seconds user
       0.008001000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkSnapshotScans* --gtest_repeat=10':

         110480.15 msec task-clock                #    1.000 CPUs utilized
               158      context-switches          #    0.001 K/sec
                 0      cpu-migrations            #    0.000 K/sec
              9729      page-faults               #    0.088 K/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

     110.482194394 seconds time elapsed

     110.395878000 seconds user
       0.084000000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkScansWithVaryingNumberOfDeletes* --gtest_repeat=10':

           2909.76 msec task-clock                #    0.994 CPUs utilized
                 9      context-switches          #    0.003 K/sec
                 0      cpu-migrations            #    0.000 K/sec
              4191      page-faults               #    0.001 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

       2.926834368 seconds time elapsed

       2.893893000 seconds user
       0.016010000 seconds sys

==== Soltuion ====

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*TestRacyConcurrentInsert* --gtest_repeat=10':

        1018513.29 msec task-clock                #    7.705 CPUs utilized
            812110      context-switches          #    0.797 K/sec
            137227      cpu-migrations            #    0.135 K/sec
              3730      page-faults               #    0.004 K/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

     132.189348914 seconds time elapsed

    1020.937652000 seconds user
       0.536985000 seconds sys

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*ConcurrentReadWriteBenchmark* --gtest_repeat=10 
--concurrent_rw_benchmark_num_writer_threads=4 
--concurrent_rw_benchmark_num_reader_threads=4':

         174832.94 msec task-clock                #    5.713 CPUs utilized
             37458      context-switches          #    0.214 K/sec
              1191      cpu-migrations            #    0.007 K/sec
           1946382      page-faults               #    0.011 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      30.601371311 seconds time elapsed

     172.509285000 seconds user
       2.588044000 seconds sys

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*ConcurrentReadWriteBenchmark* --gtest_repeat=10 
--concurrent_rw_benchmark_num_writer_threads=16 
--concurrent_rw_benchmark_num_reader_threads=16':

         510441.49 msec task-clock                #    7.776 CPUs utilized
            169054      context-switches          #    0.331 K/sec
             17615      cpu-migrations            #    0.035 K/sec
           2281144      page-faults               #    0.004 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      65.645440680 seconds time elapsed

     508.128974000 seconds user
       2.584854000 seconds sys

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*ConcurrentReadWriteBenchmark* --gtest_repeat=10 
--concurrent_rw_benchmark_num_writer_threads=4 
--concurrent_rw_benchmark_num_reader_threads=16 
--concurrent_rw_benchmark_reader_boost=20':

        1385197.11 msec task-clock                #    7.527 CPUs utilized
            107446      context-switches          #    0.078 K/sec
              8889      cpu-migrations            #    0.006 K/sec
           1479917      page-faults               #    0.001 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

     184.042601788 seconds time elapsed

    1383.208541000 seconds user
       2.100502000 seconds sys

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*TestScanPerformance* --gtest_repeat=10':

          46523.62 msec task-clock                #    1.000 CPUs utilized
                78      context-switches          #    0.002 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             81874      page-faults               #    0.002 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      46.542133217 seconds time elapsed

      46.387550000 seconds user
       0.135999000 seconds sys

 Performance counter stats for 'bin/memrowset-test 
--gtest_filter=*TestMemRowSetUpdatePerformance* --gtest_repeat=10':

          16507.47 msec task-clock                #    0.999 CPUs utilized
                35      context-switches          #    0.002 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             73399      page-faults               #    0.004 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      16.525400780 seconds time elapsed

      16.399518000 seconds user
       0.108023000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkManyUpdatesToOneRow* --gtest_repeat=10':

            455.59 msec task-clock                #    0.963 CPUs utilized
                 3      context-switches          #    0.007 K/sec
                 1      cpu-migrations            #    0.002 K/sec
              4545      page-faults               #    0.010 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

       0.473311020 seconds time elapsed

       0.451741000 seconds user
       0.004033000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkSnapshotScans* --gtest_repeat=10':

         100909.53 msec task-clock                #    1.000 CPUs utilized
               361      context-switches          #    0.004 K/sec
                 0      cpu-migrations            #    0.000 K/sec
              9727      page-faults               #    0.096 K/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

     100.915394127 seconds time elapsed

     100.872796000 seconds user
       0.035998000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkScansWithVaryingNumberOfDeletes* --gtest_repeat=10':

           2686.32 msec task-clock                #    0.993 CPUs utilized
                 6      context-switches          #    0.002 K/sec
                 0      cpu-migrations            #    0.000 K/sec
              4191      page-faults               #    0.002 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

       2.706225727 seconds time elapsed

       2.662467000 seconds user
       0.023986000 seconds sys

Change-Id: Ie7c02dc3b9444599ef680fa9b29ae7a4d39dd382
Reviewed-on: http://gerrit.cloudera.org:8080/21127
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <[email protected]>
(cherry picked from commit 2161950da43f9daa71f2297683e60e3ff7fcd1db)
---
M src/kudu/tablet/concurrent_btree.h
1 file changed, 108 insertions(+), 72 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/21499/1
--
To view, visit http://gerrit.cloudera.org:8080/21499
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.17.x
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie7c02dc3b9444599ef680fa9b29ae7a4d39dd382
Gerrit-Change-Number: 21499
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Zoltan Martonka <[email protected]>

Reply via email to