[
https://issues.apache.org/jira/browse/CASSANDRA-10580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053675#comment-15053675
]
Paulo Motta commented on CASSANDRA-10580:
-----------------------------------------
While re-reviewing your patch I just noticed we already log dropped messages on
{{MessageDeliveryTask.logDroppedMetrics()}}. Actually, we used to log dropped
messages individually before, but it became very verbose so we started logging
a summary every minute instead (more details on CASSANDRA-1284). Sorry for not
checking this before.
I think a more robust/elegant approach is to provide a new {{Timer}} metric
{{droppedTime}} or {{timeTaken}} on {{DroppedMessageMetrics}}, and print the
average dropped time on {{MessagingService.logDroppedMetrics()}}. One benefit
of this approach is that it will automatically allow to plot and consume this
metric in real-time via JMX. Another aesthetic benefit is that we would not
need to repeat the logging logic on {{MessageDeliveryTask}},
{{LocalMutationRunnable}} and {{DroppableRunnable}}, since they already report
statistics to {{MessagingService}} via {{incrementDroppedMessages()}}.
In order to provide dropped mutation metrics per KS/Table we would need to add
a new counter metric {{droppedMutations}} to {{TableMetrics}}. This will be a
bit more complex but still doable, so we can leave it for another ticket if you
don't want to do it now. If you're not familiar with the metrics system you
can have a look in the classes with name ending in {{Metrics}} for more
background.
Please let me know if you need some help with this approach.
> On dropped mutations, more details should be logged.
> ----------------------------------------------------
>
> Key: CASSANDRA-10580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10580
> Project: Cassandra
> Issue Type: Improvement
> Components: Coordination
> Environment: Production
> Reporter: Anubhav Kale
> Assignee: Anubhav Kale
> Priority: Minor
> Fix For: 3.2, 2.2.x
>
> Attachments: 10580.patch, CASSANDRA-10580-Head.patch, Trunk.patch
>
>
> In our production cluster, we are seeing a large number of dropped mutations.
> At a minimum, we should print the time the thread took to get scheduled
> thereby dropping the mutation (We should also print the Message / Mutation so
> it helps in figuring out which column family was affected). This will help
> find the right tuning parameter for write_timeout_in_ms.
> The change is small and is in StorageProxy.java and MessagingTask.java. I
> will submit a patch shortly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)