Comments inline below: > -----Original Message----- > From: Cosmin Lehene [mailto:[EMAIL PROTECTED] > Sent: Tuesday, September 09, 2008 7:25 AM > To: [email protected] > Subject: Re: Hbase corrupts data after reporting MSG_REPORT_CLOSE to master > during compaction and split process > > Hi, > > I managed to reproduce the corruption and also have full debug logs, but > first I'll explain the whys and hows of the bug and also how I think it can > be fixed. > ( I'm going to send my takeaways on how we managed to insert 300GB in less > then 6 hours on a 5 node cluster and also some advice/issues in another > mail.) > > Next assumptions are based on understanding the actual code (don't worry if I > didn't get them all right, please read the entire mail). > > - The master _assigns_ a region to a server by sending a MSG_REGION_OPEN > - On heartbeat region servers report the current load and a list of MLR - > most loaded regions (in fact just a list of first N online regions). > - Upon opening a newly assigned region, a region server will try to compact > and split that region. > - The region is NOT marked offline when compaction starts > - The region is marked OFFLINE:true, SPLIT:true during a SPLIT > > Our scenario goes this way: > > Master (M) assigns region A to region server R1 > R1 starts compaction and split of A > R1 on heart beat sends it's load and a list of MLR that contains A
This list should only be a list of open regions and should not include any regions in the process of being opened. In addition, the region server should attach a number of MSG_REPORT_PROCESS_OPEN to the heartbeat (one for each region being opened). This should prevent the master from reassigning those regions. > M decides to reassign the extra regions and sends a MSG_CLOSE_REGION A to R1 > R1 finishes the compaction and splits A into A1 and A2 (A1 has the same start > key as A) If, in fact, the region server is including regions that are not completely open in the load list, this is a bug. > M assigns A a to R2 > R2 starts compaction and split of A > R2 finishes the compaction and splits A into A_clone_1 and A_clone_2 > (A_clone_1 has the same start key as A and IMPORTANT the same start key as > A1) Whenever two region servers start working on the same reason, chaos ensues. It is rare that corruption *will not* happen in this case. > Now we get A1 and A_clone_1 almost identical starting with the same key. > Cluster is corrupted. We should care less what happens next. But the ideea is > that they are both in .META. > > I figured several places where this could be avoided and I'm going to state a > few disjoint questions. Both Master and Region could be held responsible in > my opinion but I guess it's a matter of architectural philosophy. Please note > that any of these question would be a starting point for the fix. > > - Why when getting a MSG_CLOSE_REGION A, the region server doesn't abort the > current compact split operation to leaving A in the original state and close > it immediately? MSG_CLOSE_REGION is sent for various different purposes. Maybe, if the master has timed out the region server, it should send something like MSG_ABORT_OPEN. > - Why doesn't a region server DELETE a region after a SPLIT?( I guess it > could be offline by then and it's not himself to decide that, but still..) The reason splits are fast is because the two children use the parent until they do a compaction. Thus the parent region must remain around until both children are no longer using the parent region. The master then garbage collects the parent. > - Why when assigning a region to a new region server the master doesn't check > the region status? It might be splitting or already split. I guess this would > need a new state. The master does check to see if a region is split or offline and will not assign it. This information is only available after the split is complete. > - Why when opening/compacting/splitting a region server doesn't check if the > region is OFFLINE:true or SPLIT:true? A region server should never receive an open message for a split or offline region. When the region server is told to open a region, it assumes it has exclusive rights to all the files of the region. > I have the logs available, however they are pretty large and I might need to > clean them a little, but I could make them available if that's really needed. > However I think the scenario and questions might be enough for a bug and a > fix. > > Thanks, > Cosmin
