The problem is There is a window between the gc pause ending and the notification from zk. During This time a regionserver could do things it should not. That is the core of this issue.
On Mar 16, 2010 12:15 PM, "tsuna" <tsuna...@gmail.com> wrote: On Tue, Mar 16, 2010 at 10:13 AM, Karthik Ranganathan <kranganat...@facebook.com> wrote: > What are your thoughts? Why not use ZooKeeper? Each RS should hold a lock in ZK while it's alive. When the RS gets suspended for an extended period of time thanks to the magic of the GC (or for some other reason FWIW), it would lose its lock, at which point the master would notice and clean up the mess. If the RS resumes, it would notice that it lost its own lock and do the right thing (commit suicide or whatever you want). -- Benoit "tsuna" Sigoure Software Engineer @ www.StumbleUpon.com