Hello Tidy Bot, Zoltan Chovan, Alexey Serbin, Ashwani Raina, Kudu Jenkins, 
Anonymous Coward (763),

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/21127

to look at the new patch set (#14).

Change subject: [ARM] Concurrent binary tree memory barriers fixed.
......................................................................

[ARM] Concurrent binary tree memory barriers fixed.

TestCBTree.TestRacyConcurrentInsert sometimes fails on ARM.
The concurrent inserts produce a tree with some incorrectly ordered
nodes. Apart from incorrect order, there are no other errors in the
tree. All inserted elements are present, but CBTree::GetCopy cannot
find some due to the incorrect ordering.

This is not a unit test error, but a real error in the CBTree
implementation. Modifying the test to only do the inserts first, and
only start checking when the tree is finalized does not prevent the bug.
So calling only the tree->Insert function from multiple threads is
sufficient to reproduce the issue (only on ARM).

Root cause:

Some memory order restrictions are not strict enough. It does not cause
a problem on x86 (There were no real changes for 11 years), but it
causes problems on ARM. x86 guarantees a globally consistent order for
stores (TSO).
ARM, in contrast, allows stores to different memory locations to be
observed differently from the program order.
More info:
https://www.sciencedirect.com/science/article/pii/S1383762124000390

Solution:

The following barriers need to be more strict:

1. When we set the Splitting/Inserting flag on a node, then it is not
  enough to flush all previous changes. It is also very important not
  to reorder any write before it. So instead of Release_Store (which
  is practically equivalent to std::memory_order_release), we need a
  2 way barrier.

2. When we call AcquireVersion or StableVersion to verify that the
version
  was not changed during our (already completed) read, a 2-way
  std::memory_order_acquire is required.

Putting the appropriate std::atomic_thread_fence(...) calls to these
places would resolve the issue. However, replacing the current function
from atomicops-internals-arm64.h to the c++ standard functions will
make the code more consistent.

Reason for changing to std::atomics:

atomicops-internals-arm64.h was picked from chromium source and the
missing functions was reimplemented. The header is gone since, and even
chromium uses a solution based on std::atomic (see
base/atomicops_internals_portable.h in chromium source). I see no reason
to update the header from chromium and implement the missing functions,
just to have one more abstraction layer, that is basically just
"function aliases" at this point.

Nodes are PACKED, which conflicts with std::atomic. However, when we
allocate a node, it is allocated with sufficient alignment for atomic
operations. Other unaligned structures (the values of the data) are
stored between nodes. Removing PACKED would increase the memory
footprint and actually slow things down slightly.

Notes:
+ std::atomic::<operation> usually blocks reordering in one direction.
  std::atomic_thread_fence(...) blocks reordering in both directions.

- std::assume_aligned is from C++20. The used __builtin_assume_aligned
  should be supported by both Clang and GCC.

- I renamed `IsLocked` to `IsLockedUnsafe` because it should only be
  used by the thread that actually holds the lock (as it is currently
  only used in `DCHECK` macros when we hold the lock).

Performance Tests on an aws c6i.xlarge :

=============== master =============

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*TestScanPerformance* --gtest_repeat=10':

          45832.36 msec task-clock                #    1.000 CPUs utilized
               165      context-switches          #    0.004 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             81868      page-faults               #    0.002 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      45.853664150 seconds time elapsed

      45.736053000 seconds user
       0.096008000 seconds sys

 Performance counter stats for 'bin/memrowset-test 
--gtest_filter=*TestMemRowSetUpdatePerformance* --gtest_repeat=10':

          15873.74 msec task-clock                #    0.999 CPUs utilized
                68      context-switches          #    0.004 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             73408      page-faults               #    0.005 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      15.893188554 seconds time elapsed

      15.745611000 seconds user
       0.128073000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkManyUpdatesToOneRow* --gtest_repeat=10':

            471.49 msec task-clock                #    0.960 CPUs utilized
                 3      context-switches          #    0.006 K/sec
                 0      cpu-migrations            #    0.000 K/sec
              4555      page-faults               #    0.010 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

       0.491115349 seconds time elapsed

       0.467662000 seconds user
       0.003999000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkSnapshotScans* --gtest_repeat=10':

         101853.33 msec task-clock                #    1.000 CPUs utilized
               210      context-switches          #    0.002 K/sec
                 2      cpu-migrations            #    0.000 K/sec
              9730      page-faults               #    0.096 K/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

     101.856694431 seconds time elapsed

     101.824828000 seconds user
       0.028001000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkScansWithVaryingNumberOfDeletes* --gtest_repeat=10':

           2639.61 msec task-clock                #    0.993 CPUs utilized
                 8      context-switches          #    0.003 K/sec
                 1      cpu-migrations            #    0.000 K/sec
              4194      page-faults               #    0.002 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

       2.659094874 seconds time elapsed

       2.635764000 seconds user
       0.004000000 seconds sys

==== Soltuion ====

 Performance counter stats for 'bin/cbtree-test 
--gtest_filter=*TestScanPerformance* --gtest_repeat=10':

          46058.00 msec task-clock                #    1.000 CPUs utilized
               138      context-switches          #    0.003 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             81865      page-faults               #    0.002 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      46.075198730 seconds time elapsed

      45.925709000 seconds user
       0.132006000 seconds sys

 Performance counter stats for 'bin/memrowset-test 
--gtest_filter=*TestMemRowSetUpdatePerformance* --gtest_repeat=10':

          15947.21 msec task-clock                #    0.999 CPUs utilized
                59      context-switches          #    0.004 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             73403      page-faults               #    0.005 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      15.969573129 seconds time elapsed

      15.851203000 seconds user
       0.096019000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkManyUpdatesToOneRow* --gtest_repeat=10':

            463.68 msec task-clock                #    0.961 CPUs utilized
                 2      context-switches          #    0.004 K/sec
                 0      cpu-migrations            #    0.000 K/sec
              4556      page-faults               #    0.010 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

       0.482432277 seconds time elapsed

       0.455777000 seconds user
       0.008066000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkSnapshotScans* --gtest_repeat=10':

          97549.62 msec task-clock                #    1.000 CPUs utilized
               260      context-switches          #    0.003 K/sec
                 1      cpu-migrations            #    0.000 K/sec
              9727      page-faults               #    0.100 K/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

      97.553517252 seconds time elapsed

      97.485083000 seconds user
       0.063995000 seconds sys

 Performance counter stats for 'bin/deltamemstore-test 
--gtest_filter=*BenchmarkScansWithVaryingNumberOfDeletes* --gtest_repeat=10':

           2592.69 msec task-clock                #    0.993 CPUs utilized
                 7      context-switches          #    0.003 K/sec
                 0      cpu-migrations            #    0.000 K/sec
              4196      page-faults               #    0.002 M/sec
   <not supported>      cycles
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

       2.610873563 seconds time elapsed

       2.580822000 seconds user
       0.012003000 seconds sys

Change-Id: Ie7c02dc3b9444599ef680fa9b29ae7a4d39dd382
---
M src/kudu/tablet/concurrent_btree.h
1 file changed, 108 insertions(+), 72 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/27/21127/14
--
To view, visit http://gerrit.cloudera.org:8080/21127
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie7c02dc3b9444599ef680fa9b29ae7a4d39dd382
Gerrit-Change-Number: 21127
Gerrit-PatchSet: 14
Gerrit-Owner: Zoltan Martonka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Anonymous Coward (763)
Gerrit-Reviewer: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Zoltan Chovan <[email protected]>
Gerrit-Reviewer: Zoltan Martonka <[email protected]>

Reply via email to