Eric Jacobsen created CASSANDRA-15464:
-----------------------------------------
Summary: Inserts to set<text> slow due to AtomicBTreePartition for
ComplexColumnData.dataSize
Key: CASSANDRA-15464
URL: https://issues.apache.org/jira/browse/CASSANDRA-15464
Project: Cassandra
Issue Type: Bug
Components: Legacy/Core
Reporter: Eric Jacobsen
Concurrent inserts to set<text> can cause client timeouts and excessive CPU due
to compare and swap in AtomicBTreePartition for ComplexColumnData.dataSize. As
the length of the set gets longer, the probability of doing the compare
decreases.
The problem we saw in production was with insertions into a set<text> with
len(set<text>) hundreds to thousands. Because of the semantics of what we store
in the set, we had not anticipated the length being more than about 10. (Almost
all rows have length <= 6, the largest observed was 7032. Total number of rows
< 4000. 3 machines were used.)
The bad behavior we saw was all machines went to 100% cpu on all cores, and
clients were timing out. Our immediate solution in production was adding more
machines (went from 3 machines to 6 machines). The stack included
partitions.AtomicBTreePartition.addAllWithSizeDelta …
ComplexColumnData.dataSize.
The AtomicBTreePartition code uses a Compare And Swap approach, yet the time
between compares is dependent on the length of the set. When the length of the
set is long, with concurrent updates, each loop is unlikely to make forward
progress and can be delayed looping.
Here is one example call stack:
```
"SharedPool-Worker-40" #167 daemon prio=10 os_prio=0 tid=0x00007f9bb4032800
nid=0x2ee5 runnable [0x00007f9b067f4000]
java.lang.Thread.State: RUNNABLE
at
org.apache.cassandra.db.rows.ComplexColumnData.dataSize(ComplexColumnData.java:114)
at org.apache.cassandra.db.rows.BTreeRow.dataSize(BTreeRow.java:373)
at
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:292)
at
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:235)
at org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:159)
at org.apache.cassandra.utils.btree.TreeBuilder.update(TreeBuilder.java:73)
at org.apache.cassandra.utils.btree.BTree.update(BTree.java:181)
at
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:155)
at org.apache.cassandra.db.Memtable.put(Memtable.java:254)
at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1204)
at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:384)
at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205)
at org.apache.cassandra.hints.Hint.applyFuture(Hint.java:99)
at org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:95)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
at
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
at java.lang.Thread.run(Thread.java:748)
```
In a test program to repro the problem, we raise the number of concurrent users
and lower the think time between queries. Updating elements of low-length sets
can occur without errors, and with long-length sets, clients time out with
errors and there are periods with all cores 99.x% CPU and with jstack shows
time going to ComplexColumnData.dataSize.
Here is the schema. Our long term application solution was to just have the set
elements be part of the primary key and avoid using set<text>, thus
guaranteeing the code does not go through ComplexColumnData.dataSize
```CREATE TABLE x.x (
x int PRIMARY KEY,
y set<text> ) ... ```
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]