[jira] [Commented] (CASSANDRA-2405) should expose 'time since last successful repair' for easier aes monitoring

Sylvain Lebresne (JIRA) Wed, 01 Jun 2011 06:54:36 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13042175#comment-13042175
 ]


Sylvain Lebresne commented on CASSANDRA-2405:
---------------------------------------------

This needs rebasing. First, two small remarks:
  * It seems we store the time in microseconds but then, when computing the 
time since last repair we use System.currentTimeMillis() - stored_time.
  * I would be in favor of calling the system table REPAIR_INFO, because the 
truth is I think it would make sense to record a number of other statistics on 
repair and it doesn't hurt to make the system table less specific. That also 
means we should probably not force any type for the value (though that can be 
easily changed later, so it's not a bit deal for this patch).
  * I think we usually put the code to query the system table in SystemTable, 
so I would move it from AntiEntropy to there.

Then more generally, a given repair involves multiple states and multiple 
nodes, so I don't think keeping only one timestamp is enough. Right now, we 
save the time of the last scheduled validation compaction on each node. With 
only that we're missing information so that people can do any reasonably inform 
decision:
    * First, this does not correspond to the last repair session started on 
that node, since the validation can be a request from another node. People may 
be interested by that information.
    * Second, given that repair concerns a given range, keeping only one 
general number is wrong (it would suggest the node have been repaired recently 
even when only one range out of 3 or 5 have been actually repaired).
   * Third, though recording the start of the validation compaction is 
important, this says nothing on the success of the repair (and we all know 
failing during repair do happen, if only because it's a fairly long operation 
during which node can die). So we need to record some info on the success of 
the operation if we don't want to return misleading information. Turns out, 
this is easy to record on the node coordinating the repair, maybe not so much 
on the other node participating in the repair.

Truth is, I'm not so sure what is the simplest way to handle this. Maybe one 
option could be to only register the start and end time of a repair session on 
the coordinator of the repair (adding the info of which range was repaired).

Also, what do people think of keeping an history (instead of just keeping the 
last number). I'm thinking a little bit ahead here, but what about storing one 
supercolumn by repair, where the super column name would be the repair session 
id (a TimeUUID really) and the columns infos on that repair. For this patch we 
would only record the range for that session, the start time and the end time 
(or maybe one end time for each node). But we would populate this a little bit 
further with stuff like CASSANDRA-2698. I think having such history would be 
fairly interesting.


> should expose 'time since last successful repair' for easier aes monitoring
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2405
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2405
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>             Fix For: 0.8.1
>
>         Attachments: CASSANDRA-2405-v2.patch, CASSANDRA-2405.patch
>
>
> The practical implementation issues of actually ensuring repair runs is 
> somewhat of an undocumented/untreated issue.
> One hopefully low hanging fruit would be to at least expose the time since 
> last successful repair for a particular column family, to make it easier to 
> write a correct script to monitor for lack of repair in a non-buggy fashion.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2405) should expose 'time since last successful repair' for easier aes monitoring

Reply via email to