[
https://issues.apache.org/jira/browse/CASSANDRA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13042175#comment-13042175
]
Sylvain Lebresne commented on CASSANDRA-2405:
---------------------------------------------
This needs rebasing. First, two small remarks:
* It seems we store the time in microseconds but then, when computing the
time since last repair we use System.currentTimeMillis() - stored_time.
* I would be in favor of calling the system table REPAIR_INFO, because the
truth is I think it would make sense to record a number of other statistics on
repair and it doesn't hurt to make the system table less specific. That also
means we should probably not force any type for the value (though that can be
easily changed later, so it's not a bit deal for this patch).
* I think we usually put the code to query the system table in SystemTable,
so I would move it from AntiEntropy to there.
Then more generally, a given repair involves multiple states and multiple
nodes, so I don't think keeping only one timestamp is enough. Right now, we
save the time of the last scheduled validation compaction on each node. With
only that we're missing information so that people can do any reasonably inform
decision:
* First, this does not correspond to the last repair session started on
that node, since the validation can be a request from another node. People may
be interested by that information.
* Second, given that repair concerns a given range, keeping only one
general number is wrong (it would suggest the node have been repaired recently
even when only one range out of 3 or 5 have been actually repaired).
* Third, though recording the start of the validation compaction is
important, this says nothing on the success of the repair (and we all know
failing during repair do happen, if only because it's a fairly long operation
during which node can die). So we need to record some info on the success of
the operation if we don't want to return misleading information. Turns out,
this is easy to record on the node coordinating the repair, maybe not so much
on the other node participating in the repair.
Truth is, I'm not so sure what is the simplest way to handle this. Maybe one
option could be to only register the start and end time of a repair session on
the coordinator of the repair (adding the info of which range was repaired).
Also, what do people think of keeping an history (instead of just keeping the
last number). I'm thinking a little bit ahead here, but what about storing one
supercolumn by repair, where the super column name would be the repair session
id (a TimeUUID really) and the columns infos on that repair. For this patch we
would only record the range for that session, the start time and the end time
(or maybe one end time for each node). But we would populate this a little bit
further with stuff like CASSANDRA-2698. I think having such history would be
fairly interesting.
> should expose 'time since last successful repair' for easier aes monitoring
> ---------------------------------------------------------------------------
>
> Key: CASSANDRA-2405
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2405
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Peter Schuller
> Assignee: Pavel Yaskevich
> Priority: Minor
> Fix For: 0.8.1
>
> Attachments: CASSANDRA-2405-v2.patch, CASSANDRA-2405.patch
>
>
> The practical implementation issues of actually ensuring repair runs is
> somewhat of an undocumented/untreated issue.
> One hopefully low hanging fruit would be to at least expose the time since
> last successful repair for a particular column family, to make it easier to
> write a correct script to monitor for lack of repair in a non-buggy fashion.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira