[
https://issues.apache.org/jira/browse/CASSANDRA-15066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861370#comment-16861370
]
Benedict commented on CASSANDRA-15066:
--------------------------------------
Thanks for even more review feedback! I've pushed some modifications that you
can take a look at, addressing most of your concerns. Some comments /
questions:
bq. in releaseSpace i’d leave a comment why thread is unparked but waiting is
not set to null (parkUntilFlushed releases it)
Rather than this, I have made absolutely explicit in the class description, and
on each variable, that every variable is updated by only one thread, and which
thread this is. Does this work for you?
bq. in resumeEpochSampling and resumeNowSampling we can use
scheduleWithFixedDelay(task, 0, … ) instead of running a task upfront
I don't think so, because we want to be certain that after the method is
invoked (particularly in the constructor) the relevant value has been updated -
particularly for {{now}} and for the very first translation call
bq. looks like monotonic clock implementation might race. each operation is
synchronized but their combination isn’t:
Good catch. This method should probably have been marked
{{@VisisbleForTesting}}, but you're right that it could have been misused in
future.
bq. In OutboundConnection, there seems to be a race condition between between
promiseToExecuteLater(); requestConnect().addListener(f -> executeAgain()); and
maybeExecuteAgain();.
Really excellent catch, but I think this is much simpler than you think - it's
a (pretty serious) typo / mistake in {{executeAgain}}, which has its condition
inverted. I'm a bit distracted today, but I'm pretty sure this fixes it (and I
think it's caused because the condition is inverted from the outer condition,
so I've inverted the inner condition to match, instead of the order of ternary
operands)
bq. Short question: How is the approximate clock implementation with sampling
better than just using a regular clock?
Just to absolutely clarify, this feature existed already, I have just packaged
it to permit disabling its use as others have also raised questions about its
purpose. The only point is to reduce the cost of the very frequent clock
accesses we perform. Ideally we would probably replace it with direct use of
RDTSC. Either way, this simply means that this patch's usage of clocks is
cleanly separated into those we require to be precise, and those we permit to
be approximate, and lets us override the implementation. We can use
{{System.nanoTime}} in all cases, and in fact see if there is even any
measurable impact.
bq. Is the intention to normalize clock calls to yield the epoch timestamp or
the intention is to improve performance by doing so periodically?
bq. Should error in approximate time be an absolute value? Especially since we
seem to compare two error values later. However, it seems it has to be the case
anyways
Could you expand a little on these questions?
bq. Also, what are we going to do with all the TODOs?.. Should we create
follow-up tickets for them?
I've been periodically auditing them, and yes I will file follow-up Jira for
any that seem to warrant it (in general I think it's OK to leave some TODOs in
place without Jira, for the next maintainer)
Thanks again for this and all of your other feedback over the past few months!
> Improvements to Internode Messaging
> -----------------------------------
>
> Key: CASSANDRA-15066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15066
> Project: Cassandra
> Issue Type: Improvement
> Components: Messaging/Internode
> Reporter: Benedict
> Assignee: Benedict
> Priority: High
> Fix For: 4.0
>
> Attachments: 20k_backfill.png, 60k_RPS.png,
> 60k_RPS_CPU_bottleneck.png, backfill_cass_perf_ft_msg_tst.svg,
> baseline_patch_vs_30x.png, increasing_reads_latency.png,
> many_reads_cass_perf_ft_msg_tst.svg
>
>
> CASSANDRA-8457 introduced asynchronous networking to internode messaging, but
> there have been several follow-up endeavours to improve some semantic issues.
> CASSANDRA-14503 and CASSANDRA-13630 are the latest such efforts, and were
> combined some months ago into a single overarching refactor of the original
> work, to address some of the issues that have been discovered. Given the
> criticality of this work to the project, we wanted to bring some more eyes to
> bear to ensure the release goes ahead smoothly. In doing so, we uncovered a
> number of issues with messaging, some of which long standing, that we felt
> needed to be addressed. This patch widens the scope of CASSANDRA-14503 and
> CASSANDRA-13630 in an effort to close the book on the messaging service, at
> least for the foreseeable future.
> The patch includes a number of clarifying refactors that touch outside of the
> {{net.async}} package, and a number of semantic changes to the {{net.async}}
> packages itself. We believe it clarifies the intent and behaviour of the
> code while improving system stability, which we will outline in comments
> below.
> https://github.com/belliottsmith/cassandra/tree/messaging-improvements
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]