On Tue, Mar 16, 2010 at 1:49 PM, Stack <st...@duboce.net> wrote: > Karthik: > > Thanks for looking into this. > > Reading over the issue, you think option #2 "not clean" before Todd > proposes changing overwrite to false. Do you still think it so? If > not, then option #2 seems straight-forward. > > While option #3 is more code, its attractive in that its a pattern we > might take on to solve other filesystem transitions; e.g. recovering > failed compactions. Do you think option #3 harder to verify? The > 'chasing logs' would be hard to do up in tests. >
I think the "chasing logs" thing is actually avoidable pretty easily. I commented on HBASE-2312 with thoughts there. Regarding option 1, I'm not entirely against the new HDFS API, so if others think it's a good solution we may as well go with it (we're already requiring patched HDFS for sync, so another simple patch isn't a huge deal). Regarding option 2, not sure what you mean be "The number of log files the RS can create will be bound." -- can you explain? Stack's point that #3 is a useful pattern for lots of transitions seems very valid to me as well. > > Thanks, > St.Ack > > P.S. Tsuna, up https://issues.apache.org/jira/browse/HBASE-2238 there > is some discussion of why hdfs state changes has to be managed in the > filesystem only, of how state can't bridge filesystem and zookeeper. > > > On Tue, Mar 16, 2010 at 11:13 AM, Karthik Ranganathan > <kranganat...@facebook.com> wrote: > > Hey guys, > > > > Just wanted to close on which solution we wanted to pick for this issue - > I was thinking about working on this one. There are 3 possibilities here. I > have briefly written up the issue and the three solutions below. > > > > Issue: > > There is a very corner case when bad things could happen(ie data loss): > > 1) RS #1 is going to roll its HLog - not yet created the new one, old one > will get no more writes > > 2) RS #1 enters GC Pause of Death > > 3) Master lists HLog files of RS#1 that is has to split as RS#1 is dead, > starts splitting > > 4) RS #1 wakes up, created the new HLog (previous one was rolled) and > appends an edit - which is lost > > > > Solution 1: > > 1) Master detects RS#1 is dead > > 2) The master renames the /hbase/.logs/<regionserver name> directory to > something else (say /hbase/.logs/<regionserver name>-dead) > > 3) Add mkdir support (as opposed to mkdirs) to HDFS - so that a file > create fails if the directory doesn't exist. Dhruba tells me this is very > doable. > > 4) RS#1 comes back up and is not able create the new hlog. It restarts > itself. > > NOTE: Need another HDFS API to be supported, Todd wants to avoid this. > This API exists in Hadoop 0.21, but is not back-ported to 0.20. > > > > Solution 2: > > 1) RS #1 has written log.1, log.2, log.3 > > 2) RS #1 is just about to write log.4 and enters gc pause before doing so > > 3) Master detects RS #1 dead > > 4) Master sees log.1, log.2, log.3. It then opens log.3 for append and > also creates log.4 as a lock > > 5) RS #1 wakes up and isn't allowed to write to either log.3 or log.4 > since HMaster holds both. > > NOTE: This changes the log file names, changes the create mode of the > log files from overwrite = true to false. Master needs to create the last > log file and open it in append mode to prevent RS from proceeding. RS will > fail if it cannot create the next log file. The number of log files the RS > can create will be bound. > > > > Solution 3: > > 1) Write "intend to roll HLog to new file hlog.N+1" to hlog.N > > 2) Open hlog.N+1 for append > > 3) Write "finished rolling" to hlog.N > > 4) continue writing to hlog.N+1 > > NOTE: This requires new types edits to go into the log file - "intent to > roll" and "finished roll". Master has to open the last log file for append > first. Also, master has to "chase" log files created by the region server > (please see the issue for details) as there is an outside chance of log > files rolling when the GC pause happens. > > > > In my opinion, from the perspective of code simplicity, I would rank the > solutions as 1 being simplest, then 2, then 3. Since 1 needs another HDFS > API, I was thinking that 2 seemed simpler to do and easier to verify > correctness. > > > > What are your thoughts? > > > > Thanks > > Karthik > > > > > > > -- Todd Lipcon Software Engineer, Cloudera