We're moving a discussion started in HBASE-5487 ( https://issues.apache.org/jira/browse/HBASE-5487) based on stack's suggestion.
Master5 refers to the design doc sergey posted, and hbck-master refers to the design doc that I posted. (I'd suggest refer to them by the doc name as opposed to whomever wrote it). Here's the questions I posed to sergey: --- I have a lot of questions. I'll hit the big questions first. Also would i be possible to put a version of this up as gdoc so we can point out nits and places that need minor clarification? (I have a marked up physical copy version of the doc, would be easier to provide feedback). Main Concerns: What is a failure and how do you react to failures? I think the master5 design needs to spend more effort to considering failure and recovery cases. I claim there are 4 types of responses from a networked IO operation - two states we normally deal with ack successful, ack failed (nack) and unknown due to timeout that succeeded (timeout success) and unknown due to timeout that failed (timeout failed). We have historically missed the last two timeout cases or assumed timeout means failure nack. It seems that master5 makes the same assumptions. I'm very concerned about what we need to do to invalidate information cached RS information at clients in the case of hang, and that will violate the isolation guarantees that we claim to provide. I really want a slice in-depth failure handling case analysis including a client with cached rs assignments for move and something more complicated such as split or alter. I really want more invariant specified for the FSM states. e.g. if a region is in state X, does it have a row in meta? does have data on the FS? is it open on another region? is it open on only one region? I think having 8 pages of tables at the back of the master5 doc can be more concise and precise which will help us get attempt to prove correctness. Clarification questions: 1) State update coordination. What is a "state updates from the outside" Do RS's initiate splitting on their own? Maybe a picture would help so we can figure out if it is similar or different from hbck-master's? 2) Single point of truth. What is this truth? what the user specficied actions? what the rs's are reporting? the last state we were confirmed to be at? hbck-master tries to define what single point of truth means by defining intended, current, and actual state data with durability properties on each kind. What do clients look at who modifies what? 3) Table record: "if regions is out of date, it should be closed and reopened". It is not clear in master5 how regionservers find out that they are out of date. Moreover, how do clients talking to those RS's with stale versions know they are going to the correct RS especially in the face of RS failures due to timeout? 4) region record: transition states. Shouldn't be defined as part of the region record? (This is really similar to hbck-masters current state and intended state. ) 5) Note on user operations: the forgetting thing is scary to me – in your move split example, what happens if an RS reads state that is forgotten? 6) table state machine. how do we guarantee clients are not writing to against out of date region versions? (in hang situations, regions could be open on multple places – the hung RS and the new RS the region was assigned to and successfully opened on) 7) region state machine. Earlier draft hand splitting and merge cases. Are they elided in master5 or are not present any more. How would this get extended handle jeffrey's distributed log replay/fast write recovery feature? 8) logical interactions: sounds like master5 allows concurrent operations in specfiic regions and and specfiic table. (e.g. it will allow moves and splits and merges on the same region). hbck-master (though not fully documented) only allows certain region transitions when the table is enabled or if the table is disabled. Are we sure we don't get into race conditions? What happens if disable gets issued – its possible for someone to reopens the region and for old clients to continue writing to it even though it is closed? nit. 9) "in cursive" mean in italics. 10) The table operations section have tables which I believe are the actions between FSM states in the table or region fsms. Is this correct? Can the edges be labeled to describe which steps these transitions correspond to? Short doc: nit: Design Constraints, code should: Have AM logic isolated from the persistent storage of state. // I think this should be "abstracted" so we can plug in different implementations of persistent storage of state. -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [email protected]
