[
https://issues.apache.org/jira/browse/CASSANDRA-11340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236042#comment-15236042
]
Russ Hatch edited comment on CASSANDRA-11340 at 4/12/16 7:12 PM:
-----------------------------------------------------------------
Tried another run today with some long-running connections and still haven't
had luck getting a repro. There's got to be something more nuanced going on
with the perf problem.
was (Author: rhatch):
Tried another run today with some longwe-running connections and still haven't
had luck getting a repro. There's got to be something more nuanced going on
with the perf problem.
> Heavy read activity on system_auth tables can cause apparent livelock
> ---------------------------------------------------------------------
>
> Key: CASSANDRA-11340
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11340
> Project: Cassandra
> Issue Type: Bug
> Reporter: Jeff Jirsa
> Assignee: Aleksey Yeschenko
> Attachments: mass_connect.py, prepare_mass_connect.py
>
>
> Reproduced in at least 2.1.9.
> It appears possible for queries against system_auth tables to trigger
> speculative retry, which causes auth to block on traffic going off node. In
> some cases, it appears possible for threads to become deadlocked, causing
> load on the nodes to increase sharply. This happens even in clusters with RF
> of system_auth == N, as all requests being served locally puts the bar for
> 99% SR pretty low.
> Incomplete stack trace below, but we haven't yet figured out what exactly is
> blocking:
> {code}
> Thread 82291: (state = BLOCKED)
> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
> may be imprecise)
> - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338
> (Compiled frame)
> -
> org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUntil(long)
> @bci=28, line=307 (Compiled frame)
> - org.apache.cassandra.utils.concurrent.SimpleCondition.await(long,
> java.util.concurrent.TimeUnit) @bci=76, line=63 (Compiled frame)
> - org.apache.cassandra.service.ReadCallback.await(long,
> java.util.concurrent.TimeUnit) @bci=25, line=92 (Compiled frame)
> -
> org.apache.cassandra.service.AbstractReadExecutor$SpeculatingReadExecutor.maybeTryAdditionalReplicas()
> @bci=39, line=281 (Compiled frame)
> - org.apache.cassandra.service.StorageProxy.fetchRows(java.util.List,
> org.apache.cassandra.db.ConsistencyLevel) @bci=175, line=1338 (Compiled frame)
> - org.apache.cassandra.service.StorageProxy.readRegular(java.util.List,
> org.apache.cassandra.db.ConsistencyLevel) @bci=9, line=1274 (Compiled frame)
> - org.apache.cassandra.service.StorageProxy.read(java.util.List,
> org.apache.cassandra.db.ConsistencyLevel,
> org.apache.cassandra.service.ClientState) @bci=57, line=1199 (Compiled frame)
> -
> org.apache.cassandra.cql3.statements.SelectStatement.execute(org.apache.cassandra.service.pager.Pageable,
> org.apache.cassandra.cql3.QueryOptions, int, long,
> org.apache.cassandra.service.QueryState) @bci=35, line=272 (Compiled frame)
> -
> org.apache.cassandra.cql3.statements.SelectStatement.execute(org.apache.cassandra.service.QueryState,
> org.apache.cassandra.cql3.QueryOptions) @bci=105, line=224 (Compiled frame)
> - org.apache.cassandra.auth.Auth.selectUser(java.lang.String) @bci=27,
> line=265 (Compiled frame)
> - org.apache.cassandra.auth.Auth.isExistingUser(java.lang.String) @bci=1,
> line=86 (Compiled frame)
> -
> org.apache.cassandra.service.ClientState.login(org.apache.cassandra.auth.AuthenticatedUser)
> @bci=11, line=206 (Compiled frame)
> -
> org.apache.cassandra.transport.messages.AuthResponse.execute(org.apache.cassandra.service.QueryState)
> @bci=58, line=82 (Compiled frame)
> -
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(io.netty.channel.ChannelHandlerContext,
> org.apache.cassandra.transport.Message$Request) @bci=75, line=439 (Compiled
> frame)
> -
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(io.netty.channel.ChannelHandlerContext,
> java.lang.Object) @bci=6, line=335 (Compiled frame)
> -
> io.netty.channel.SimpleChannelInboundHandler.channelRead(io.netty.channel.ChannelHandlerContext,
> java.lang.Object) @bci=17, line=105 (Compiled frame)
> -
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(java.lang.Object)
> @bci=9, line=333 (Compiled frame)
> -
> io.netty.channel.AbstractChannelHandlerContext.access$700(io.netty.channel.AbstractChannelHandlerContext,
> java.lang.Object) @bci=2, line=32 (Compiled frame)
> - io.netty.channel.AbstractChannelHandlerContext$8.run() @bci=8, line=324
> (Compiled frame)
> - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511
> (Compiled frame)
> -
> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run()
> @bci=5, line=164 (Compiled frame)
> - org.apache.cassandra.concurrent.SEPWorker.run() @bci=87, line=105
> (Interpreted frame)
> - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
> {code}
> In a cluster with many connected clients (potentially thousands), a
> reconnection flood (for example, restarting all at once) is likely to trigger
> this bug. However, it is unlikely to be seen in normal operation.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)