[ https://issues.apache.org/jira/browse/HBASE-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack resolved HBASE-2617. -------------------------- Hadoop Flags: [Reviewed] Resolution: Fixed I applied patch. Thanks for review Jon (I changed message to say 'overloaded' again). > Load balancer falls into pathological state if one server under average - > slop; endless churn > --------------------------------------------------------------------------------------------- > > Key: HBASE-2617 > URL: https://issues.apache.org/jira/browse/HBASE-2617 > Project: HBase > Issue Type: Bug > Reporter: stack > Assignee: stack > Priority: Blocker > Fix For: 0.20.5 > > Attachments: 2617-v2.txt, 2617.txt > > > I'm looking at a 0.20.4 cluster of 80 fast machines. It runs fine for a > while and then falls over into crazy balancing churn (My view on logs is > sporadic but have a log before me where this is happening). Things I see > that seem to be of 0.20.4 particularly: > + We don't reach an equilibrium or at least it takes so long, its as though > it wasn't every going to happen > + Master log filled w/ open, close of one or two regions usually the same > ones over and over (The Regions that are candidates to close are provided by > the RS. They are ordered by hash of their name. We return the top ten from > this Set every time. So, we always close the same regions all the time even > if we just opened it) > + Often, we'll tell an RS to close a region. It will do the job. In 0.20.4 > we made it so if RS has any work at all for the master, that we return > immediately rather than wait for the reporting period to elaspse. So, on > these fast machines, it can be back near immediately if it just opened > another some other region, say. It can get assigned the region it just > closed. Seems to happen frequently enough. > For example, look at the below extract featuring a single regions life. Its > opened and closed 5 times in about 1/2 a second: > {code} > 2010-05-25 11:01:05,488 DEBUG org.apache.hadoop.hbase.master.HMaster: > Processing todo: ProcessRegionClose of > GenericMetaStore,139757491,1274779304880, false, reassign: true > 2010-05-25 11:01:05,489 INFO > org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as > unassigned: GenericMetaStore,139757491,1274779304880 > 2010-05-25 11:01:05,490 INFO org.apache.hadoop.hbase.master.RegionManager: > Assigning region GenericMetaStore,139757491,1274779304880 to > a025.example.com,60020,1274744064673 > 2010-05-25 11:01:05,510 INFO org.apache.hadoop.hbase.master.ServerManager: > Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from > a025.example.com,60020,1274744064673; 1 of 1 > 2010-05-25 11:01:05,510 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: > GenericMetaStore,139757491,1274779304880 open on 10.209.32.189:60020 > 2010-05-25 11:01:05,511 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: Updated row > GenericMetaStore,139757491,1274779304880 in region .META.,,1 with > startcode=1274744064673, server=10.209.32.189:60020 > 2010-05-25 11:01:05,548 DEBUG org.apache.hadoop.hbase.master.RegionManager: > Going to close region GenericMetaStore,139757491,1274779304880 > 2010-05-25 11:01:05,552 INFO org.apache.hadoop.hbase.master.ServerManager: > Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from > a025.example.com,60020,1274744064673; 1 of 2 > 2010-05-25 11:01:05,552 DEBUG org.apache.hadoop.hbase.master.HMaster: > Processing todo: ProcessRegionClose of > GenericMetaStore,139757491,1274779304880, false, reassign: true > 2010-05-25 11:01:05,552 INFO > org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as > unassigned: GenericMetaStore,139757491,1274779304880 > 2010-05-25 11:01:05,556 INFO org.apache.hadoop.hbase.master.RegionManager: > Assigning region GenericMetaStore,139757491,1274779304880 to > a028.example.com,60020,1274747560769 > 2010-05-25 11:01:05,578 INFO org.apache.hadoop.hbase.master.ServerManager: > Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from > a028.example.com,60020,1274747560769; 1 of 1 > 2010-05-25 11:01:05,578 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: > GenericMetaStore,139757491,1274779304880 open on 10.209.32.185:60020 > 2010-05-25 11:01:05,579 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: Updated row > GenericMetaStore,139757491,1274779304880 in region .META.,,1 with > startcode=1274747560769, server=10.209.32.185:60020 > 2010-05-25 11:01:05,599 DEBUG org.apache.hadoop.hbase.master.RegionManager: > Going to close region GenericMetaStore,139757491,1274779304880 > 2010-05-25 11:01:05,605 INFO org.apache.hadoop.hbase.master.ServerManager: > Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from > a028.example.com,60020,1274747560769; 1 of 2 > 2010-05-25 11:01:05,605 DEBUG org.apache.hadoop.hbase.master.HMaster: > Processing todo: ProcessRegionClose of > GenericMetaStore,139757491,1274779304880, false, reassign: true > 2010-05-25 11:01:05,606 INFO > org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as > unassigned: GenericMetaStore,139757491,1274779304880 > 2010-05-25 11:01:05,607 INFO org.apache.hadoop.hbase.master.RegionManager: > Assigning region GenericMetaStore,139757491,1274779304880 to > sjc1c104.example.com,60020,1274747062601 > 2010-05-25 11:01:05,640 INFO org.apache.hadoop.hbase.master.ServerManager: > Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from > sjc1c104.example.com,60020,1274747062601; 1 of 1 > 2010-05-25 11:01:05,640 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: > GenericMetaStore,139757491,1274779304880 open on 10.209.42.181:60020 > 2010-05-25 11:01:05,641 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: Updated row > GenericMetaStore,139757491,1274779304880 in region .META.,,1 with > startcode=1274747062601, server=10.209.42.181:60020 > 2010-05-25 11:01:05,723 DEBUG org.apache.hadoop.hbase.master.RegionManager: > Going to close region GenericMetaStore,139757491,1274779304880 > 2010-05-25 11:01:05,729 INFO org.apache.hadoop.hbase.master.ServerManager: > Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from > sjc1c104.example.com,60020,1274747062601; 1 of 4 > 2010-05-25 11:01:05,729 DEBUG org.apache.hadoop.hbase.master.HMaster: > Processing todo: ProcessRegionClose of > GenericMetaStore,139757491,1274779304880, false, reassign: true > 2010-05-25 11:01:05,730 INFO > org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as > unassigned: GenericMetaStore,139757491,1274779304880 > 2010-05-25 11:01:05,731 INFO org.apache.hadoop.hbase.master.RegionManager: > Assigning region GenericMetaStore,139757491,1274779304880 to > sjc1c091.example.com,60020,1274747056415 > 2010-05-25 11:01:05,751 INFO org.apache.hadoop.hbase.master.ServerManager: > Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from > sjc1c091.example.com,60020,1274747056415; 1 of 1 > 2010-05-25 11:01:05,752 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: > GenericMetaStore,139757491,1274779304880 open on 10.209.42.238:60020 > 2010-05-25 11:01:05,752 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: Updated row > GenericMetaStore,139757491,1274779304880 in region .META.,,1 with > startcode=1274747056415, server=10.209.42.238:60020 > 2010-05-25 11:01:05,775 DEBUG org.apache.hadoop.hbase.master.RegionManager: > Going to close region GenericMetaStore,139757491,1274779304880 > 2010-05-25 11:01:05,780 INFO org.apache.hadoop.hbase.master.ServerManager: > Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from > sjc1c091.example.com,60020,1274747056415; 1 of 2 > 2010-05-25 11:01:05,780 DEBUG org.apache.hadoop.hbase.master.HMaster: > Processing todo: ProcessRegionClose of > GenericMetaStore,139757491,1274779304880, false, reassign: true > 2010-05-25 11:01:05,780 INFO > org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as > unassigned: GenericMetaStore,139757491,1274779304880 > 2010-05-25 11:01:05,808 INFO org.apache.hadoop.hbase.master.RegionManager: > Assigning region GenericMetaStore,139757491,1274779304880 to > sjc1a003.example.com,60020,1274747057557 > 2010-05-25 11:01:05,828 INFO org.apache.hadoop.hbase.master.ServerManager: > Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from > sjc1a003.example.com,60020,1274747057557; 1 of 1 > 2010-05-25 11:01:05,828 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: > GenericMetaStore,139757491,1274779304880 open on 10.209.32.148:60020 > 2010-05-25 11:01:05,829 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: Updated row > GenericMetaStore,139757491,1274779304880 in region .META.,,1 with > startcode=1274747057557, server=10.209.32.148:60020 > {code} > The culprit seems to be the code that wants to bring up underloaded > regionservers up to average. That and something about the lightly loaded > servers math that is off. > I'm marking this as a blocker. I'm not sure why its not more common. There > were some issues on this cluster regards disks filling but though such an > event may have provoked the issue, we should have evened out eventually. > Making this a blocker on 0.20.5. Need to fix it for this user at least. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.