[jira] [Commented] (CASSANDRA-6476) Assertion error in MessagingService.addCallback

Sylvain Lebresne (JIRA) Thu, 12 Dec 2013 03:40:49 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846256#comment-13846256
 ]


Sylvain Lebresne commented on CASSANDRA-6476:
---------------------------------------------

MessagingService ain't the native transport (fyi, the native transport code 
doesn't leak outside the org.apache.cassandra.transport package), it's the 
intra-cluster messaging. In fact the stack trace shows that the write that 
trigger it don't even come from the native protocol but from thrift (which 
means you either use thrift for some things or something is whack).

But truth is, given the stack trace, where the writes comes from doesn't 
matter.  The assertion that fails is the line
{noformat}
assert previous == null;
{noformat}
in MessagingService.addCallback. And that's where things stop to make sense to 
me. This means that we tried to add a new message to the callback map but there 
was one with the same messageId already. Except that messageId is very 
straighforwardly generated by an {{incrementAndGet}} on an static 
AtomicInteger. And as far as I can tell, no other code inserts in the callback 
map without grabing a new messageId this way (except setCallbackForTests, but 
it does is only use in a unit test).

Therefore, it seems the only way such messageId conflict could happen is that 
we've gone full cycle on the AtomicInteger and hit the same id again. But 
entries in callbacks expire after the rpc timeout, so that implies > 4 billions 
requests in about 10 seconds. Sounds pretty unlikely to me.

But I might be missing something obvious: [~jbellis], I believe you might be 
more familiar with MessagingService, any idea?


> Assertion error in MessagingService.addCallback
> -----------------------------------------------
>
>                 Key: CASSANDRA-6476
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6476
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 2.0.2 DCE
>            Reporter: Theo Hultberg
>            Assignee: Sylvain Lebresne
>
> Two of the three Cassandra nodes in one of our clusters just started behaving 
> very strange about an hour ago. Within a minute of each other they started 
> logging AssertionErrors (see stack traces here: 
> https://gist.github.com/iconara/7917438) over and over again. The client lost 
> connection with the nodes at roughly the same time. The nodes were still up, 
> and even if no clients were connected to them they continued logging the same 
> errors over and over.
> The errors are in the native transport (specifically 
> MessagingService.addCallback) which makes me suspect that it has something to 
> do with a test that we started running this afternoon. I've just implemented 
> support for frame compression in my CQL driver cql-rb. About two hours before 
> this happened I deployed a version of the application which enabled Snappy 
> compression on all frames larger than 64 bytes. It's not impossible that 
> there is a bug somewhere in the driver or compression library that caused 
> this -- but at the same time, it feels like it shouldn't be possible to make 
> C* a zombie with a bad frame.
> Restarting seems to have got them back running again, but I suspect they will 
> go down again sooner or later.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (CASSANDRA-6476) Assertion error in MessagingService.addCallback

Reply via email to