On Tue, Mar 16, 2010 at 6:04 PM, Karthik Ranganathan < kranganat...@facebook.com> wrote:
> @Stack: With the overwrite=false, I think option #2 looks fine. > > @Todd: By "bound number of log files", I meant that the if the latest log > file is log.N, the master would try to open log.N+1, log.N+2, log.N+3 etc > until one of it succeeds and the RS cannot open more log files after that > one. So the master "bounds" the number of times the log file is opened. > > I am a little nervous about the master backing off on detecting the RS's > progress - because the RS has already lost its zk lease. Not sure that if > the master backs off, this will allow everything to proceed smoothly. But > probably calling sync() on zk makes sense. Will think about this some more. > > I too like option #3 because it's a useful pattern, but it was initially > much easier to reason about #2. Of course #1 is the easiest either way. > Again, let me think about this more. > > What do you think about the trick of making the RS do a ZK sync before any meta op? This forces it to take at most one action after it's been terminated. > Thanks > Karthik > > > -----Original Message----- > From: Todd Lipcon [mailto:t...@cloudera.com] > Sent: Tuesday, March 16, 2010 3:18 PM > To: hbase-dev@hadoop.apache.org > Subject: Re: HBASE-2312 discussion > > On Tue, Mar 16, 2010 at 1:49 PM, Stack <st...@duboce.net> wrote: > > > Karthik: > > > > Thanks for looking into this. > > > > Reading over the issue, you think option #2 "not clean" before Todd > > proposes changing overwrite to false. Do you still think it so? If > > not, then option #2 seems straight-forward. > > > > While option #3 is more code, its attractive in that its a pattern we > > might take on to solve other filesystem transitions; e.g. recovering > > failed compactions. Do you think option #3 harder to verify? The > > 'chasing logs' would be hard to do up in tests. > > > > I think the "chasing logs" thing is actually avoidable pretty easily. I > commented on HBASE-2312 with thoughts there. > > Regarding option 1, I'm not entirely against the new HDFS API, so if others > think it's a good solution we may as well go with it (we're already > requiring patched HDFS for sync, so another simple patch isn't a huge > deal). > > Regarding option 2, not sure what you mean be "The number of log files the > RS can create will be bound." -- can you explain? > > Stack's point that #3 is a useful pattern for lots of transitions seems > very > valid to me as well. > > > > > > Thanks, > > St.Ack > > > > P.S. Tsuna, up https://issues.apache.org/jira/browse/HBASE-2238 there > > is some discussion of why hdfs state changes has to be managed in the > > filesystem only, of how state can't bridge filesystem and zookeeper. > > > > > > On Tue, Mar 16, 2010 at 11:13 AM, Karthik Ranganathan > > <kranganat...@facebook.com> wrote: > > > Hey guys, > > > > > > Just wanted to close on which solution we wanted to pick for this issue > - > > I was thinking about working on this one. There are 3 possibilities here. > I > > have briefly written up the issue and the three solutions below. > > > > > > Issue: > > > There is a very corner case when bad things could happen(ie data loss): > > > 1) RS #1 is going to roll its HLog - not yet created the new one, old > one > > will get no more writes > > > 2) RS #1 enters GC Pause of Death > > > 3) Master lists HLog files of RS#1 that is has to split as RS#1 is > dead, > > starts splitting > > > 4) RS #1 wakes up, created the new HLog (previous one was rolled) and > > appends an edit - which is lost > > > > > > Solution 1: > > > 1) Master detects RS#1 is dead > > > 2) The master renames the /hbase/.logs/<regionserver name> directory to > > something else (say /hbase/.logs/<regionserver name>-dead) > > > 3) Add mkdir support (as opposed to mkdirs) to HDFS - so that a file > > create fails if the directory doesn't exist. Dhruba tells me this is very > > doable. > > > 4) RS#1 comes back up and is not able create the new hlog. It restarts > > itself. > > > NOTE: Need another HDFS API to be supported, Todd wants to avoid this. > > This API exists in Hadoop 0.21, but is not back-ported to 0.20. > > > > > > Solution 2: > > > 1) RS #1 has written log.1, log.2, log.3 > > > 2) RS #1 is just about to write log.4 and enters gc pause before doing > so > > > 3) Master detects RS #1 dead > > > 4) Master sees log.1, log.2, log.3. It then opens log.3 for append and > > also creates log.4 as a lock > > > 5) RS #1 wakes up and isn't allowed to write to either log.3 or log.4 > > since HMaster holds both. > > > NOTE: This changes the log file names, changes the create mode of the > > log files from overwrite = true to false. Master needs to create the last > > log file and open it in append mode to prevent RS from proceeding. RS > will > > fail if it cannot create the next log file. The number of log files the > RS > > can create will be bound. > > > > > > Solution 3: > > > 1) Write "intend to roll HLog to new file hlog.N+1" to hlog.N > > > 2) Open hlog.N+1 for append > > > 3) Write "finished rolling" to hlog.N > > > 4) continue writing to hlog.N+1 > > > NOTE: This requires new types edits to go into the log file - "intent > to > > roll" and "finished roll". Master has to open the last log file for > append > > first. Also, master has to "chase" log files created by the region server > > (please see the issue for details) as there is an outside chance of log > > files rolling when the GC pause happens. > > > > > > In my opinion, from the perspective of code simplicity, I would rank > the > > solutions as 1 being simplest, then 2, then 3. Since 1 needs another HDFS > > API, I was thinking that 2 seemed simpler to do and easier to verify > > correctness. > > > > > > What are your thoughts? > > > > > > Thanks > > > Karthik > > > > > > > > > > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera