[
https://issues.apache.org/jira/browse/CASSANDRA-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846256#comment-13846256
]
Sylvain Lebresne commented on CASSANDRA-6476:
---------------------------------------------
MessagingService ain't the native transport (fyi, the native transport code
doesn't leak outside the org.apache.cassandra.transport package), it's the
intra-cluster messaging. In fact the stack trace shows that the write that
trigger it don't even come from the native protocol but from thrift (which
means you either use thrift for some things or something is whack).
But truth is, given the stack trace, where the writes comes from doesn't
matter. The assertion that fails is the line
{noformat}
assert previous == null;
{noformat}
in MessagingService.addCallback. And that's where things stop to make sense to
me. This means that we tried to add a new message to the callback map but there
was one with the same messageId already. Except that messageId is very
straighforwardly generated by an {{incrementAndGet}} on an static
AtomicInteger. And as far as I can tell, no other code inserts in the callback
map without grabing a new messageId this way (except setCallbackForTests, but
it does is only use in a unit test).
Therefore, it seems the only way such messageId conflict could happen is that
we've gone full cycle on the AtomicInteger and hit the same id again. But
entries in callbacks expire after the rpc timeout, so that implies > 4 billions
requests in about 10 seconds. Sounds pretty unlikely to me.
But I might be missing something obvious: [~jbellis], I believe you might be
more familiar with MessagingService, any idea?
> Assertion error in MessagingService.addCallback
> -----------------------------------------------
>
> Key: CASSANDRA-6476
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6476
> Project: Cassandra
> Issue Type: Bug
> Environment: Cassandra 2.0.2 DCE
> Reporter: Theo Hultberg
> Assignee: Sylvain Lebresne
>
> Two of the three Cassandra nodes in one of our clusters just started behaving
> very strange about an hour ago. Within a minute of each other they started
> logging AssertionErrors (see stack traces here:
> https://gist.github.com/iconara/7917438) over and over again. The client lost
> connection with the nodes at roughly the same time. The nodes were still up,
> and even if no clients were connected to them they continued logging the same
> errors over and over.
> The errors are in the native transport (specifically
> MessagingService.addCallback) which makes me suspect that it has something to
> do with a test that we started running this afternoon. I've just implemented
> support for frame compression in my CQL driver cql-rb. About two hours before
> this happened I deployed a version of the application which enabled Snappy
> compression on all frames larger than 64 bytes. It's not impossible that
> there is a bug somewhere in the driver or compression library that caused
> this -- but at the same time, it feels like it shouldn't be possible to make
> C* a zombie with a bad frame.
> Restarting seems to have got them back running again, but I suspect they will
> go down again sooner or later.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)