Re: HBASE-2312 discussion

Dhruba Borthakur Tue, 16 Mar 2010 21:40:04 -0700

I like Option 1 too, it looks clean in the sense that if the Master renames
the directory, then the old region server can never write any new logs files
in that directory, it is similar to IO-fencing methods uses by traditional
cluster services http://en.wikipedia.org/wiki/Fencing_(computing).


thanks,
dhruba

On Tue, Mar 16, 2010 at 6:08 PM, Todd Lipcon <t...@cloudera.com> wrote:

> On Tue, Mar 16, 2010 at 6:04 PM, Karthik Ranganathan <
> kranganat...@facebook.com> wrote:
>
> > @Stack: With the overwrite=false, I think option #2 looks fine.
> >
> > @Todd: By "bound number of log files", I meant that the if the latest log
> > file is log.N, the master would try to open log.N+1, log.N+2, log.N+3 etc
> > until one of it succeeds and the RS cannot open more log files after that
> > one. So the master "bounds" the number of times the log file is opened.
> >
> > I am a little nervous about the master backing off on detecting the RS's
> > progress - because the RS has already lost its zk lease. Not sure that if
> > the master backs off, this will allow everything to proceed smoothly. But
> > probably calling sync() on zk makes sense. Will think about this some
> more.
> >
> > I too like option #3 because it's a useful pattern, but it was initially
> > much easier to reason about #2. Of course #1 is the easiest either way.
> > Again, let me think about this more.
> >
> >
> What do you think about the trick of making the RS do a ZK sync before any
> meta op? This forces it to take at most one action after it's been
> terminated.
>
>
> > Thanks
> > Karthik
> >
> >
> > -----Original Message-----
> > From: Todd Lipcon [mailto:t...@cloudera.com]
> > Sent: Tuesday, March 16, 2010 3:18 PM
> > To: hbase-dev@hadoop.apache.org
> > Subject: Re: HBASE-2312 discussion
> >
> > On Tue, Mar 16, 2010 at 1:49 PM, Stack <st...@duboce.net> wrote:
> >
> > > Karthik:
> > >
> > > Thanks for looking into this.
> > >
> > > Reading over the issue, you think option #2 "not clean" before Todd
> > > proposes changing overwrite to false.  Do you still think it so?  If
> > > not, then option #2 seems straight-forward.
> > >
> > > While option #3 is more code, its attractive in that its a pattern we
> > > might take on to solve other filesystem transitions; e.g. recovering
> > > failed compactions.  Do you think option #3 harder to verify?  The
> > > 'chasing logs' would be hard to do up in tests.
> > >
> >
> > I think the "chasing logs" thing is actually avoidable pretty easily. I
> > commented on HBASE-2312 with thoughts there.
> >
> > Regarding option 1, I'm not entirely against the new HDFS API, so if
> others
> > think it's a good solution we may as well go with it (we're already
> > requiring patched HDFS for sync, so another simple patch isn't a huge
> > deal).
> >
> > Regarding option 2, not sure what you mean be "The number of log files
> the
> > RS can create will be bound." -- can you explain?
> >
> > Stack's point that #3 is a useful pattern for lots of transitions seems
> > very
> > valid to me as well.
> >
> >
> > >
> > > Thanks,
> > > St.Ack
> > >
> > > P.S. Tsuna, up https://issues.apache.org/jira/browse/HBASE-2238 there
> > > is some discussion of why hdfs state changes has to be managed in the
> > > filesystem only, of how state can't bridge filesystem and zookeeper.
> > >
> > >
> > > On Tue, Mar 16, 2010 at 11:13 AM, Karthik Ranganathan
> > > <kranganat...@facebook.com> wrote:
> > > > Hey guys,
> > > >
> > > > Just wanted to close on which solution we wanted to pick for this
> issue
> > -
> > > I was thinking about working on this one. There are 3 possibilities
> here.
> > I
> > > have briefly written up the issue and the three solutions below.
> > > >
> > > > Issue:
> > > > There is a very corner case when bad things could happen(ie data
> loss):
> > > > 1) RS #1 is going to roll its HLog - not yet created the new one, old
> > one
> > > will get no more writes
> > > > 2) RS #1 enters GC Pause of Death
> > > > 3) Master lists HLog files of RS#1 that is has to split as RS#1 is
> > dead,
> > > starts splitting
> > > > 4) RS #1 wakes up, created the new HLog (previous one was rolled) and
> > > appends an edit - which is lost
> > > >
> > > > Solution 1:
> > > > 1) Master detects RS#1 is dead
> > > > 2) The master renames the /hbase/.logs/<regionserver name> directory
> to
> > > something else (say /hbase/.logs/<regionserver name>-dead)
> > > > 3) Add mkdir support (as opposed to mkdirs) to HDFS - so that a file
> > > create fails if the directory doesn't exist. Dhruba tells me this is
> very
> > > doable.
> > > > 4) RS#1 comes back up and is not able create the new hlog. It
> restarts
> > > itself.
> > > > NOTE: Need another HDFS API to be supported, Todd wants to avoid
> this.
> > > This API exists in Hadoop 0.21, but is not back-ported to 0.20.
> > > >
> > > > Solution 2:
> > > > 1) RS #1 has written log.1, log.2, log.3
> > > > 2) RS #1 is just about to write log.4 and enters gc pause before
> doing
> > so
> > > > 3) Master detects RS #1 dead
> > > > 4) Master sees log.1, log.2, log.3. It then opens log.3 for append
> and
> > > also creates log.4 as a lock
> > > > 5) RS #1 wakes up and isn't allowed to write to either log.3 or log.4
> > > since HMaster holds both.
> > > > NOTE:  This changes the log file names, changes the create mode of
> the
> > > log files from overwrite = true to false. Master needs to create the
> last
> > > log file and open it in append mode to prevent RS from proceeding. RS
> > will
> > > fail if it cannot create the next log file. The number of log files the
> > RS
> > > can create will be bound.
> > > >
> > > > Solution 3:
> > > > 1) Write "intend to roll HLog to new file hlog.N+1" to hlog.N
> > > > 2) Open hlog.N+1 for append
> > > > 3) Write "finished rolling" to hlog.N
> > > > 4) continue writing to hlog.N+1
> > > > NOTE: This requires new types edits to go into the log file - "intent
> > to
> > > roll" and "finished roll". Master has to open the last log file for
> > append
> > > first. Also, master has to "chase" log files created by the region
> server
> > > (please see the issue for details) as there is an outside chance of log
> > > files rolling when the GC pause happens.
> > > >
> > > > In my opinion, from the perspective of code simplicity, I would rank
> > the
> > > solutions as 1 being simplest, then 2, then 3. Since 1 needs another
> HDFS
> > > API, I was thinking that 2 seemed simpler to do and easier to verify
> > > correctness.
> > > >
> > > > What are your thoughts?
> > > >
> > > > Thanks
> > > > Karthik
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Connect to me at http://www.facebook.com/dhruba

Re: HBASE-2312 discussion

Reply via email to