[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

Sergio Bossa (Jira) Wed, 03 Jun 2020 11:25:25 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125211#comment-17125211
 ]


Sergio Bossa commented on CASSANDRA-15700:
------------------------------------------

[~aleksey] thanks for reviewing: getting back to this after two months and 
having to rebuild the whole thing in my mind, made me realize how complex this 
is, and that I should have put a couple more comments, so apologies and kudos 
for getting through it all alone :)

That said, to answer your question straight: the race you point out is 
absolutely correct, and was unfortunately overlooked by myself, but there 
should be an easy fix. Before going with that, I think I have to clarify first 
why I introduced the new {{nextExpirationDeadline}} variable along 
{{earliestExpiryTime}}. Simply put, if we only track time via 
{{earliestExpiryTime}} during both {{add}} and {{prune}}, we risk getting into 
a race where we accumulate an unbounded number of messages until the next 
expiration, as shown in the following scenario:
1. {{add}} is called N times with {{earliestExpiryTime}} set as the minimum 
time among the added messages; you can't adjust in this case by the current 
time otherwise you would never expire (as the time would always shift).
2. {{prune}} is called and {{earliestExpiryTime}} is set as the minimum between 
the minimum time among the pruned messages and the current expiry time.
3. This means that any messages arrived between the start and end of the 
pruning whose expiry time was not the minimum, but still less than the minimum 
expiry time among pruned messages, would be "ignored" and remain in the queue.

At this point you might say it really sounds like the race you discovered and 
in a way it is :) Although in the above case the race window would be much 
larger (the whole {{prune}}).

Anyway, as I said there should be an easy fix: the deadline should be updated 
only if it's actually the minimum value (adjusted by current time), and I've 
sent a new commit with such fix. I don't see a way to fix the same kind of 
races by just keeping a single variable, but let me know if you find any.

> Performance regression on internode messaging
> ---------------------------------------------
>
>                 Key: CASSANDRA-15700
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Internode
>            Reporter: Sergio Bossa
>            Assignee: Sergio Bossa
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0-beta
>
>         Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, 
> oss40_nogc.tar.xz, oss40_system.log
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

Reply via email to