File a blocker please Lars. I'm pretty sure the boolean on whether we are doing a recovery or not has been there a long time so yeah, a single server recovery could throw us off, but you make a practical point, that one server should not destroy locality over the cluster.
St.Ack On Tue, Mar 14, 2017 at 7:09 AM, Lars George <[email protected]> wrote: > Wait, HBASE-15251 is not enough methinks. The checks added help, but > are not covering all the possible edge cases. In particular, say a > node really fails, why not just reassign the few regions it did hold > and leave all the others where they are? Seems insane as it is. > > On Tue, Mar 14, 2017 at 2:24 PM, Lars George <[email protected]> > wrote: > > Doh, https://issues.apache.org/jira/browse/HBASE-15251 addresses this > > (though I am not sure exactly how, see below). This should be > > backported to all 1.x branches! > > > > As for the patch, I see this > > > > if (!failover) { > > // Fresh cluster startup. > > - LOG.info("Clean cluster startup. Assigning user regions"); > > + LOG.info("Clean cluster startup. Don't reassign user regions"); > > assignAllUserRegions(allRegions); > > + } else { > > + LOG.info("Failover! Reassign user regions"); > > } > > > > Would be interesting to see how that log message is actually > > reassigning the user regions. I would have assumed the > > "assignAllUserRegions()" would have to be called. > > > > > > On Tue, Mar 14, 2017 at 2:21 PM, Lars George <[email protected]> > wrote: > >> Looking at the code more... it seems the issue is here > >> > >> In AssignmentManager.processDeadServersAndRegionsInTransition(): > >> > >> ... > >> failoverCleanupDone(); > >> if (!failover) { > >> // Fresh cluster startup. > >> LOG.info("Clean cluster startup. Assigning user regions"); > >> assignAllUserRegions(allRegions); > >> } > >> ... > >> > >> As soon as a single node failed, failover is set to true. So no > >> assignment is done. Later on the servers are recovered from the > >> "crash" (which a clean restart is as well), and then all unassigned > >> regions (all of them now, since they are unassigned earlier) are > >> assigned round-robin. > >> > >> > >> > >> On Tue, Mar 14, 2017 at 1:54 PM, Lars George <[email protected]> > wrote: > >>> Hi, > >>> > >>> I had this happened at multiple clusters recently where after the > >>> restart the locality dropped from close to or exactly 100% down to > >>> single digits. The reason is that all regions were completely shuffled > >>> and reassigned to random servers. Upon reading the (yet again > >>> non-trivial) assignment code, I found that a single server missing > >>> will trigger a full "recovery" of all servers, which includes a drop > >>> of all previous assignments and random new assignment. > >>> > >>> This is just terrible! In addition, I also assumed that - at least the > >>> StochasticLoadBalancer - is checking which node holds most of the data > >>> of a region locality wise and picks that server. But that is _not_ the > >>> case! It just spreads everything seemingly randomly across the > >>> servers. > >>> > >>> To me this is a big regression (or straight bug) given that a single > >>> server out of, for example, hundreds could trigger that and destroy > >>> the locality completely. Running a major compaction is not an approach > >>> for many reasons. > >>> > >>> This used to work better, why that regression? > >>> > >>> Lars >
