HI Jeremy, For more insight into the hint system, these two blog posts are great resources: http://www.datastax.com/dev/blog/modern-hinted-handoff, and http://www.datastax.com/dev/blog/whats-coming-to-cassandra-in-3-0-improved-hint-storage-and-delivery .
For timeframes, that's going to differ based on your read/write patterns and load. Although I haven't tried this before, I believe you can query the system.hints table to see the status of hints queued by the local machine. --local and --dc are similar in the sense that they are always repairs against the local datacenter, they just differ in syntax. If you sustain loss of inter-dc connectivity for longer than max_hint_window_in_ms, you'll want to run a cross-dc repair, which is just the standard full repair (without specifying either). On Mon, Feb 29, 2016 at 7:38 PM, Jimmy Lin <y2klyf+w...@gmail.com> wrote: > hi Bryan, > I guess I want to find out if there is any way to tell when data will > become consistent again in both cases. > > if the node being down shorter than the max_hint_window(say 2 hours out of > 3 hrs max), is there anyway to check the log or JMX etc to see if the hint > queue size back to zero or lower range? > > > if node goes down longer than max_hint_window time (say 4 hrs hours > our > max 3 hrs), we run repair job. What is the correct nodetool repair job > syntax to use? > in particular what is the difference between -local vs -dc? they both > seems to indicate repairing nodes within a datacenter, but for across DC > network outage, we want to repair nodes across DCs right? > > thanks > > > > On Fri, Feb 26, 2016 at 3:38 PM, Bryan Cheng <br...@blockcypher.com> > wrote: > >> Hi Jimmy, >> >> If you sustain a long downtime, repair is almost always the way to go. >> >> It seems like you're asking to what extent a cluster is able to >> recover/resync a downed peer. >> >> A peer will not attempt to reacquire all the data it has missed while >> being down. Recovery happens in a few ways: >> >> 1) Hints: Assuming that there are enough peers to satisfy your quorum >> requirements on write, the live peers will queue up these operations for up >> to max_hint_window_in_ms (from cassandra.yaml). These hints will be >> delivered once the peer recovers. >> 2) Read repair: There is a probability that read repair will happen, >> meaning that a query will trigger data consistency checks and updates _on >> the query being performed_. >> 3) Repair. >> >> If a machine goes down for longer than max_hint_window_in_ms, AFAIK you >> _will_ have missing data. If you cannot tolerate this situation, you need >> to take a look at your tunable consistency and/or trigger a repair. >> >> On Thu, Feb 25, 2016 at 7:26 PM, Jimmy Lin <y2klyf+w...@gmail.com> wrote: >> >>> so far they are not long, just some config change and restart. >>> if it is a 2 hrs downtime due to whatever reason, a repair is better >>> option than trying to figure out if replication syn finish or not? >>> >>> On Thu, Feb 25, 2016 at 1:09 PM, daemeon reiydelle <daeme...@gmail.com> >>> wrote: >>> >>>> Hmm. What are your processes when a node comes back after "a long >>>> offline"? Long enough to take the node offline and do a repair? Run the >>>> risk of serving stale data? Parallel repairs? ??? >>>> >>>> So, what sort of time frames are "a long time"? >>>> >>>> >>>> *.......* >>>> >>>> >>>> >>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 >>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 >>>> <%28%2B44%29%20%280%29%2020%208144%209872>* >>>> >>>> On Thu, Feb 25, 2016 at 11:36 AM, Jimmy Lin <y2k...@gmail.com> wrote: >>>> >>>>> hi all, >>>>> >>>>> what are the better ways to check replication overall status of cassandra >>>>> cluster? >>>>> >>>>> within a single DC, unless a node is down for long time, most of the >>>>> time i feel it is pretty much non-issue and things are replicated pretty >>>>> fast. But when a node come back from a long offline, is there a way to >>>>> check that the node has finished its data sync with other nodes ? >>>>> >>>>> Now across DC, we have frequent VPN outage (sometime short sometims >>>>> long) between DCs, i also like to know if there is a way to find how the >>>>> replication progress between DC catching up under this condtion? >>>>> >>>>> Also, if i understand correctly, the only gaurantee way to make sure >>>>> data are synced is to run a complete repair job, >>>>> is that correct? I am trying to see if there is a way to "force a quick >>>>> replication sync" between DCs after vpn outage. >>>>> Or maybe this is unnecessary, as Cassandra will catch up as fast as it >>>>> can, there is nothing else we/(system admin) can do to make it faster or >>>>> better? >>>>> >>>>> >>>>> >>>>> Sent from my iPhone >>>>> >>>> >>>> >>> >> >