[
https://issues.apache.org/jira/browse/CASSANDRA-10580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055007#comment-15055007
]
Paulo Motta commented on CASSANDRA-10580:
-----------------------------------------
Looking good. A few comments:
* I think you're fine using {{System.currentTimeMillis()}} instead of
{{ApproximateTime}} to avoid imprecisions.
* Rename {{TimeTaken}} to {{DroppedLatency}} to be consistent with other
similar metric names (https://wiki.apache.org/cassandra/Metrics).
** Actually, I think it's better to have 2 metrics {{InternalDroppedLatency}}
and {{CrossNodeDroppedLatency}}, since they will be quite different (see
CASSANDRA-9793 for more information).
* Add tests to check metrics are correct, probably on {{MessagingServiceTests}}
* You'll probably also want to verify if the metrics are working by bringing up
a cluster manually or with ccm, stress it with cassandra-stress and see if new
metrics are being recorded correctly via jmx with visualvm.
* The latest patch does not apply with {{fatal: corrupt patch at line 125}}, I
don't know exactly what's that. I wonder if it's a cross-platform thing. Are
you able to apply it locally?
Answering your questions:
bq. Also, a question: It appears that Timer.Update appends entries to the
metric (which is what we want). Do you know at what point it starts dropping
new appends / starts giving up ? I wonder if there is a huge number of dropped
mutations will the timeTaken metric mess up ?
I think the metrics package already handles that. I think {{Timer}} metrics
keeps running averages and not the actual values, so no need to cleanup afaik.
bq. To make this work for CF, I will probably pass the mutation to
MessagingService.LogDroppedMessages (maybe through an overload) and update the
metrics on appropriate CF. Does that make sense ?
sounds good
bq. If this change looks good, I am more inclined towards making this work for
CF before making up patches for old branches. Let me know if that's okay.
sure, watch out for additional details with CF metrics such as cleaning up the
metrics if CF is dropped, etc. You'll probably want to integrate this with the
{{TableMetrics}} class.
> On dropped mutations, more details should be logged.
> ----------------------------------------------------
>
> Key: CASSANDRA-10580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10580
> Project: Cassandra
> Issue Type: Improvement
> Components: Coordination
> Environment: Production
> Reporter: Anubhav Kale
> Assignee: Anubhav Kale
> Priority: Minor
> Fix For: 3.2, 2.2.x
>
> Attachments: 10580-Metrics.patch, 10580.patch,
> CASSANDRA-10580-Head.patch, Trunk.patch
>
>
> In our production cluster, we are seeing a large number of dropped mutations.
> At a minimum, we should print the time the thread took to get scheduled
> thereby dropping the mutation (We should also print the Message / Mutation so
> it helps in figuring out which column family was affected). This will help
> find the right tuning parameter for write_timeout_in_ms.
> The change is small and is in StorageProxy.java and MessagingTask.java. I
> will submit a patch shortly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)