[
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124959#comment-17124959
]
Aleksey Yeschenko commented on CASSANDRA-15700:
-----------------------------------------------
I *think* the way we update times at the end of pruning in this patch is
potentially problematic, but I might be getting this wrong, so pardon me for
any obvious foolishness. Was thinking about this particular scenario:
1. After partial consumption/delivery, thread {{t1}} updates
{{earliestExpiryTime}} with some high value (say, all messages up to now have
been with long timeouts); gets descheduled for a period of time.
2. A different thread, {{t2}}, invokes {{add()}} with an lots of
short-expiration messages, all skipping purging, as the lock is being held by
{{t1}}. Repeatedly updates both {{earliestExpiryTime}} and
{{nextExpirationDeadline}} with a small value
3. Thread {{t1}} is now scheduled again, sets {{nextExpirationDeadline}} to the
high value from step (1)
If there are no {{add()}} incoming and delivery has stalled, we can now hold an
arbitrary # of messages added at step 2, as we use {{nextExpirationDeadline}}
value to determine whether or not pruning is necessary.
> Performance regression on internode messaging
> ---------------------------------------------
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
> Issue Type: Bug
> Components: Messaging/Internode
> Reporter: Sergio Bossa
> Assignee: Sergio Bossa
> Priority: Normal
> Labels: pull-request-available
> Fix For: 4.0-beta
>
> Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc,
> oss40_nogc.tar.xz, oss40_system.log
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Me and [~jasonstack] have been investigating a performance regression
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
> It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
> creates a huge amount of strings in the {{id()}} call. The next question is
> what causes all those message expirations; we thoroughly reviewed the
> internode messaging code and the only issue we could find so far is related
> to the "batch pruning" calls
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
> and
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
> it _seems_ too much time is spent on those, causing the event loop to fall
> behind in processing the rest of the messages, which will end up being
> expired. This is supported by the analysis of the collapsed stacks (after
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which
> spends half of its time pruning. But only a tiny portion of such pruning time
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to
> understand: what's the reason to prune so often, rather than just when
> polling the message during delivery? If there's a reason I'm missing, let's
> talk about how to optimize pruning, otherwise let's get rid of that.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]