[jira] [Comment Edited] (CASSANDRA-18580) Baseline Metrics for Accord Transactions

Jacek Lewandowski (Jira) Tue, 01 Aug 2023 00:44:05 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749519#comment-17749519
 ]


Jacek Lewandowski edited comment on CASSANDRA-18580 at 8/1/23 7:43 AM:
-----------------------------------------------------------------------

It is just an example of how I wanted to capture events in the ProgressLog. 
{{AccordService#coordinate()}} measures the overall transaction time at 
coordinator. With {{ProgressLog}}, the same transaction will be measured 
several times, and we will see how each replica deals with it separately. This 
way, we distinguish two categories of metrics - Accord coordinator metrics 
(current CQL metrics) and Accord replica metrics which I think should be 
registered under separate namespaces for clarity. 

In terms of what latency metrics we can gather that way - currently, we have 3 
timestamps available:
- {{txnId}} - when the transaction was submitted
- {{executeAt}} - ultimate transaction timestamp 
- the current timestamp at commit, pre/post execution, durable...
and we can probably measure the time between any two of them

We could excessively create more than needed now and decide which should stay 
before the release. I assume there will be some performance work that will 
clarify which metrics make the most sense.

Regarding preemptions and recoveries - I was thinking about capturing those 
events in {{AccordAgent#onRecover}} implementation. Each invocation would mark 
the recoveries meter, and if the failure is {{Preempted}}, it would 
additionally mark the preemptions meter. That could be presented as a ratio 
gauge.

My understanding of the timestamps is - {{TxnId}} is created at the coordinator 
when the CQL statement arrives. The transaction is committed on each replica, 
so if we want to measure start-to-commit time, we must base it on timestamps 
captured on different nodes. As far as I understand, the {{Timestamp}} consists 
of 128 bits and node id, which is stored in a separate field. {{hlc}} part is a 
long initialized with {{AccordService.nowInMicros()}} and does not include any 
flags, epoch, or node identifier. 

btw. thanks for looking at my patch


was (Author: jlewandowski):
It is just an example of how I wanted to capture events in the ProgressLog. 
{{AccordService#coordinate()}} measures the overall transaction time at 
coordinator. With {{ProgressLog}}, the same transaction will be measured 
several times, and we will see how each replica deals with it separately. This 
way, we distinguish two categories of metrics - Accord coordinator metrics 
(current CQL metrics) and Accord replica metrics which I think should be 
registered under separate namespaces for clarity. 

In terms of what latency metrics we can gather that way - currently, we have 3 
timestamps available:
- {{txnId}} - when the transaction was submitted
- {{executeAt}} - ultimate transaction timestamp 
- the current timestamp at commit, pre/post execution, durable...
and we can probably measure the time between any two of them

We could excessively create more than needed now and decide which should stay 
before the release. I assume there will be some performance work that will 
clarify which metrics make the most sense.

Regarding preemptions and recoveries - I was thinking about capturing those 
events in {{AccordAgent#onRecover}} implementation. Each invocation would mark 
the recoveries meter, and if the failure is {{Preempted}}, it would 
additionally mark the preemptions meter. That could be presented as a ratio 
gauge.

My understanding of the timestamps is - {{TxnId}} is created at the coordinator 
when the CQL statement arrives. The transaction is committed on each replica, 
so if we want to measure start-to-commit time, we must base it on timestamps 
captured on different nodes. As far as I understand, the {{Timestamp}} consists 
of 128 bits and node id, which is stored in a separate field. {{hlc}} part is a 
long initialized with {{AccordService.nowInMicros()}} and does not include any 
flags, epoch, or node identifier. 


> Baseline Metrics for Accord Transactions
> ----------------------------------------
>
>                 Key: CASSANDRA-18580
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18580
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Accord, Observability/JMX, Observability/Metrics
>            Reporter: Caleb Rackliffe
>            Assignee: Jacek Lewandowski
>            Priority: Normal
>             Fix For: 5.x
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Based on some conversations w/ [~benedict] and [~dcapwell], this is the 
> initial set of metrics that seem both feasible to implement and useful as we 
> monitor the health of a cluster performing Accord transactions:
> 1.) Basic latency metrics for transactions up to the point of COMMIT and rate 
> metrics for preemption, failure, and timeouts at the coordinator.
> This has already been implemented and split into read and write-specific 
> metrics. Our position for now is that metrics around preemption should be 
> useful in place of a more difficult-to-define metric around how many 
> transactions are completed via recovery.
> 2.) Global cache stats/metrics (i.e. aggregated for all command stores)
> We could, at some point, build metrics scoped to a specific {{CommandStore}}, 
> but they might be awkward in MBean/JMX space, as command stores would have to 
> be identified by ID or key range…the latter possibly being able to change 
> across epochs. (An alternative would be just publishing command 
> store-specific stats on-demand to a virtual table instead.)
> 3.) Something like a decaying histogram of the number of dependencies per 
> transaction (or per partial transaction).
> If this is getting worse over time, it could be useful to know/be a way for 
> us to detect that contention is increasing. We should be able to hook this up 
> to {{ProgressLog}} notifications. Recording for PartialDeps/PartialTxn (which 
> ProgressLog gives us at pre-accept) seems acceptable, given this is a 
> directional metric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-18580) Baseline Metrics for Accord Transactions

Reply via email to