Loved the "Juliet" terminology as well :).

@Todd: I agree we will need something like #2 or especially #3 in other places.

Looks like we have a consensus - I will update the JIRA.

Thanks
Karthik


-----Original Message-----
From: Todd Lipcon [mailto:t...@cloudera.com] 
Sent: Tuesday, March 16, 2010 10:09 PM
To: hbase-dev@hadoop.apache.org
Subject: Re: HBASE-2312 discussion

On Tue, Mar 16, 2010 at 8:59 PM, Stack <st...@duboce.net> wrote:

> On Tue, Mar 16, 2010 at 5:08 PM, Todd Lipcon <t...@cloudera.com> wrote:
> >
> > What do you think about the trick of making the RS do a ZK sync before
> any
> > meta op? This forces it to take at most one action after it's been
> > terminated.
> >
>
> ... where meta op is open of new WAL log?
>
> How would this work?  RS would note in ZK the name of the WAL its
> about to open before it did it?  If the RS then does a "Juliet" --
>
[haha, love this terminology!]

> i.e. goes into a GC pause death-like coma -- on revivial, it'll go to
> open the WAL but master will have already done so, and so it'll fail?
>
>
I was actually referring to the explicit sync call in ZK:
http://hadoop.apache.org/zookeeper/docs/r3.2.1/api/org/apache/zookeeper/ZooKeeper.html#sync%28java.lang.String,%20org.apache.zookeeper.AsyncCallback.VoidCallback,%20java.lang.Object%29

The javadoc isn't that clear, but the way I understand this call is that it
makes sure the client's view of the world is up-to-date with respect to the
ZK leader at the beginning of the sync call.

The "note" box at the bottom of this section also explains it pretty well:
http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperProgrammers.html#ch_zkGuarantees

If we insert this between any transitions, I think we can ensure that the
region server will only do at most one operation after losing its lease.
This means that whole "chasing the log" thing is unnecessary.



> @Karthik "I am a little nervous about the master backing off on
> detecting the RS's progress - because the RS has already lost its zk
> lease."
>
> Yes.  The RS will have had its 'shut-yourself-down' flag set on
> loss-of-lease so is on its way out.  Its not going to revive so its
> logs need recovering.
>
> @Kannan "Option #1 seems easy to reason about and simple to implement.
> Can we go ahead with that if there is no major objection?"
>
> Fine by me.
>

Fine by me as well. I think we'll need solutions like 2 or 3 other places,
but for this one #1 seems to work (I'll continue to think if there are any
holes in our logic)

-Todd


-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to