[ 
https://issues.apache.org/jira/browse/CASSANDRA-11340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233214#comment-15233214
 ] 

Russ Hatch commented on CASSANDRA-11340:
----------------------------------------

I've been attempting to reproduce this issue, but haven't had any success so 
far. Have tried 20 and 60 node clusters (ec2).

Using PasswordAuthenticator, CassandraAuthorizer, various values for 
permissions_validity_in_ms. I've managed to get nodes up to several hundred 
connections, but cpu utilization always recovers afterwards. This was with 
short duration connections, so perhaps this has something to do with connection 
duration as well.

I'll attach my python script I was using to generate connections and create a 
small amount of writes.

> Heavy read activity on system_auth tables can cause apparent livelock
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-11340
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11340
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jeff Jirsa
>            Assignee: Aleksey Yeschenko
>
> Reproduced in at least 2.1.9. 
> It appears possible for queries against system_auth tables to trigger 
> speculative retry, which causes auth to block on traffic going off node. In 
> some cases, it appears possible for threads to become deadlocked, causing 
> load on the nodes to increase sharply. This happens even in clusters with RF 
> of system_auth == N, as all requests being served locally puts the bar for 
> 99% SR pretty low. 
> Incomplete stack trace below, but we haven't yet figured out what exactly is 
> blocking:
> {code}
> Thread 82291: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - 
> org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUntil(long)
>  @bci=28, line=307 (Compiled frame)
>  - org.apache.cassandra.utils.concurrent.SimpleCondition.await(long, 
> java.util.concurrent.TimeUnit) @bci=76, line=63 (Compiled frame)
>  - org.apache.cassandra.service.ReadCallback.await(long, 
> java.util.concurrent.TimeUnit) @bci=25, line=92 (Compiled frame)
>  - 
> org.apache.cassandra.service.AbstractReadExecutor$SpeculatingReadExecutor.maybeTryAdditionalReplicas()
>  @bci=39, line=281 (Compiled frame)
>  - org.apache.cassandra.service.StorageProxy.fetchRows(java.util.List, 
> org.apache.cassandra.db.ConsistencyLevel) @bci=175, line=1338 (Compiled frame)
>  - org.apache.cassandra.service.StorageProxy.readRegular(java.util.List, 
> org.apache.cassandra.db.ConsistencyLevel) @bci=9, line=1274 (Compiled frame)
>  - org.apache.cassandra.service.StorageProxy.read(java.util.List, 
> org.apache.cassandra.db.ConsistencyLevel, 
> org.apache.cassandra.service.ClientState) @bci=57, line=1199 (Compiled frame)
>  - 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(org.apache.cassandra.service.pager.Pageable,
>  org.apache.cassandra.cql3.QueryOptions, int, long, 
> org.apache.cassandra.service.QueryState) @bci=35, line=272 (Compiled frame)
>  - 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(org.apache.cassandra.service.QueryState,
>  org.apache.cassandra.cql3.QueryOptions) @bci=105, line=224 (Compiled frame)
>  - org.apache.cassandra.auth.Auth.selectUser(java.lang.String) @bci=27, 
> line=265 (Compiled frame)
>  - org.apache.cassandra.auth.Auth.isExistingUser(java.lang.String) @bci=1, 
> line=86 (Compiled frame)
>  - 
> org.apache.cassandra.service.ClientState.login(org.apache.cassandra.auth.AuthenticatedUser)
>  @bci=11, line=206 (Compiled frame)
>  - 
> org.apache.cassandra.transport.messages.AuthResponse.execute(org.apache.cassandra.service.QueryState)
>  @bci=58, line=82 (Compiled frame)
>  - 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(io.netty.channel.ChannelHandlerContext,
>  org.apache.cassandra.transport.Message$Request) @bci=75, line=439 (Compiled 
> frame)
>  - 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(io.netty.channel.ChannelHandlerContext,
>  java.lang.Object) @bci=6, line=335 (Compiled frame)
>  - 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(io.netty.channel.ChannelHandlerContext,
>  java.lang.Object) @bci=17, line=105 (Compiled frame)
>  - 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(java.lang.Object)
>  @bci=9, line=333 (Compiled frame)
>  - 
> io.netty.channel.AbstractChannelHandlerContext.access$700(io.netty.channel.AbstractChannelHandlerContext,
>  java.lang.Object) @bci=2, line=32 (Compiled frame)
>  - io.netty.channel.AbstractChannelHandlerContext$8.run() @bci=8, line=324 
> (Compiled frame)
>  - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 
> (Compiled frame)
>  - 
> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run()
>  @bci=5, line=164 (Compiled frame)
>  - org.apache.cassandra.concurrent.SEPWorker.run() @bci=87, line=105 
> (Interpreted frame)
>  - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
> {code}
> In a cluster with many connected clients (potentially thousands), a 
> reconnection flood (for example, restarting all at once) is likely to trigger 
> this bug. However, it is unlikely to be seen in normal operation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to