[jira] Commented: (HBASE-2617) Load balancer falls into pathological state if one server under average - slop; endless churn

HBase Review Board (JIRA) Thu, 27 May 2010 10:35:02 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872298#action_12872298
 ]


HBase Review Board commented on HBASE-2617:
-------------------------------------------

Message from: "Jonathan Gray" <[email protected]>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/97/#review85
-----------------------------------------------------------

Ship it!


This seems to fix the problem described.  I recall one other weird issue in the 
load balancing when reviewing that code before the recent rewrite for trunk, 
but can't find it now.  The way I had read this previously was it seemed to 
almost always fall into the "don't know what to do, just assign one" branch, 
but looks better now.

Do we have a way to add additional regionservers in a cluster test but not 
during the initialization where we do it now?  That might be one way we could 
test interesting load balancing situations.

Only nitpick is the log messages used to say overloaded but now say "more than 
their fair share"... overloaded seemed descriptive enough :)

In any case, this is better than what is there now.  +1

- Jonathan


On 2010-05-27 10:13:14, stack wrote:
^bq.  
^bq.  -----------------------------------------------------------
^bq.  This is an automatically generated e-mail. To reply, visit:
^bq.  http://review.hbase.org/r/97/
^bq.  -----------------------------------------------------------
^bq.  
^bq.  (Updated 2010-05-27 10:13:14)
^bq.  
^bq.  
^bq.  Review request for hbase and Ryan Rawson.
^bq.  
^bq.  
^bq.  Summary
^bq.  -------
^bq.  
^bq.  Fix for pathological balancer.  Mostly logging changes to remove 
confusion as to what the balancer is up to but also removed a mistake where 
we'd always give a server one region ('just in case") if regions to assign 
though it might be a server that is loaded.  In clusters w/ lots of nodes with 
a few of the nodes not in-balance, the lightly loaded nodes would never get 
regions so we'd just keep reassigning and never settle.
^bq.  
^bq.  
^bq.  This addresses bug hbase-2617.
^bq.  
^bq.  
^bq.  Diffs
^bq.  -----
^bq.  
^bq.    
branches/0.20/src/java/org/apache/hadoop/hbase/master/RegionManager.java 948860 
^bq.  
^bq.  Diff: http://review.hbase.org/r/97/diff
^bq.  
^bq.  
^bq.  Testing
^bq.  -------
^bq.  
^bq.  
^bq.  Thanks,
^bq.  
^bq.  stack
^bq.  
^bq. 




> Load balancer falls into pathological state if one server under average - 
> slop; endless churn
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2617
>                 URL: https://issues.apache.org/jira/browse/HBASE-2617
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.20.5
>
>         Attachments: 2617-v2.txt, 2617.txt
>
>
> I'm looking at a 0.20.4 cluster of 80 fast machines.  It runs fine for a 
> while and then falls over into crazy balancing churn (My view on logs is 
> sporadic but have a log before me where this is happening).  Things I see 
> that seem to be of 0.20.4 particularly:
> + We don't reach an equilibrium or at least it takes so long, its as though 
> it wasn't every going to happen
> + Master log filled w/ open, close of one or two regions  usually the same 
> ones over and over (The Regions that are candidates to close are provided by 
> the RS.  They are ordered by hash of their name.  We return the top ten from 
> this Set every time.  So, we always close the same regions all the time even 
> if we just opened it)
> + Often, we'll tell an RS to close a region.  It will do the job.  In 0.20.4 
> we made it so if RS has any work at all for the master, that we return 
> immediately rather than wait for the reporting period to elaspse.   So, on 
> these fast machines, it can be back near immediately if it just opened 
> another some other region, say.  It can get assigned the region it just 
> closed.  Seems to happen frequently enough.
> For example, look at the below extract featuring a single regions life.  Its 
> opened and closed 5 times in about 1/2 a second:
> {code}
> 2010-05-25 11:01:05,488 DEBUG org.apache.hadoop.hbase.master.HMaster: 
> Processing todo: ProcessRegionClose of 
> GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,489 INFO 
> org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as 
> unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,490 INFO org.apache.hadoop.hbase.master.RegionManager: 
> Assigning region GenericMetaStore,139757491,1274779304880 to 
> a025.example.com,60020,1274744064673
> 2010-05-25 11:01:05,510 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from 
> a025.example.com,60020,1274744064673; 1 of 1
> 2010-05-25 11:01:05,510 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: 
> GenericMetaStore,139757491,1274779304880 open on 10.209.32.189:60020
> 2010-05-25 11:01:05,511 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row 
> GenericMetaStore,139757491,1274779304880 in region .META.,,1 with 
> startcode=1274744064673, server=10.209.32.189:60020
> 2010-05-25 11:01:05,548 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
> Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,552 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from 
> a025.example.com,60020,1274744064673; 1 of 2
> 2010-05-25 11:01:05,552 DEBUG org.apache.hadoop.hbase.master.HMaster: 
> Processing todo: ProcessRegionClose of 
> GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,552 INFO 
> org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as 
> unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,556 INFO org.apache.hadoop.hbase.master.RegionManager: 
> Assigning region GenericMetaStore,139757491,1274779304880 to 
> a028.example.com,60020,1274747560769
> 2010-05-25 11:01:05,578 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from 
> a028.example.com,60020,1274747560769; 1 of 1
> 2010-05-25 11:01:05,578 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: 
> GenericMetaStore,139757491,1274779304880 open on 10.209.32.185:60020
> 2010-05-25 11:01:05,579 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row 
> GenericMetaStore,139757491,1274779304880 in region .META.,,1 with 
> startcode=1274747560769, server=10.209.32.185:60020
> 2010-05-25 11:01:05,599 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
> Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,605 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from 
> a028.example.com,60020,1274747560769; 1 of 2
> 2010-05-25 11:01:05,605 DEBUG org.apache.hadoop.hbase.master.HMaster: 
> Processing todo: ProcessRegionClose of 
> GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,606 INFO 
> org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as 
> unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,607 INFO org.apache.hadoop.hbase.master.RegionManager: 
> Assigning region GenericMetaStore,139757491,1274779304880 to 
> sjc1c104.example.com,60020,1274747062601
> 2010-05-25 11:01:05,640 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from 
> sjc1c104.example.com,60020,1274747062601; 1 of 1
> 2010-05-25 11:01:05,640 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: 
> GenericMetaStore,139757491,1274779304880 open on 10.209.42.181:60020
> 2010-05-25 11:01:05,641 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row 
> GenericMetaStore,139757491,1274779304880 in region .META.,,1 with 
> startcode=1274747062601, server=10.209.42.181:60020
> 2010-05-25 11:01:05,723 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
> Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,729 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from 
> sjc1c104.example.com,60020,1274747062601; 1 of 4
> 2010-05-25 11:01:05,729 DEBUG org.apache.hadoop.hbase.master.HMaster: 
> Processing todo: ProcessRegionClose of 
> GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,730 INFO 
> org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as 
> unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,731 INFO org.apache.hadoop.hbase.master.RegionManager: 
> Assigning region GenericMetaStore,139757491,1274779304880 to 
> sjc1c091.example.com,60020,1274747056415
> 2010-05-25 11:01:05,751 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from 
> sjc1c091.example.com,60020,1274747056415; 1 of 1
> 2010-05-25 11:01:05,752 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: 
> GenericMetaStore,139757491,1274779304880 open on 10.209.42.238:60020
> 2010-05-25 11:01:05,752 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row 
> GenericMetaStore,139757491,1274779304880 in region .META.,,1 with 
> startcode=1274747056415, server=10.209.42.238:60020
> 2010-05-25 11:01:05,775 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
> Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,780 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from 
> sjc1c091.example.com,60020,1274747056415; 1 of 2
> 2010-05-25 11:01:05,780 DEBUG org.apache.hadoop.hbase.master.HMaster: 
> Processing todo: ProcessRegionClose of 
> GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,780 INFO 
> org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as 
> unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,808 INFO org.apache.hadoop.hbase.master.RegionManager: 
> Assigning region GenericMetaStore,139757491,1274779304880 to 
> sjc1a003.example.com,60020,1274747057557
> 2010-05-25 11:01:05,828 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from 
> sjc1a003.example.com,60020,1274747057557; 1 of 1
> 2010-05-25 11:01:05,828 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: 
> GenericMetaStore,139757491,1274779304880 open on 10.209.32.148:60020
> 2010-05-25 11:01:05,829 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row 
> GenericMetaStore,139757491,1274779304880 in region .META.,,1 with 
> startcode=1274747057557, server=10.209.32.148:60020
> {code}
> The culprit seems to be the code that wants to bring up underloaded 
> regionservers up to average.  That and something about the lightly loaded 
> servers math that is off.
> I'm marking this as a blocker.  I'm not sure why its not more common.  There 
> were some issues on this cluster regards disks filling but though such an 
> event may have provoked the issue, we should have evened out eventually.
> Making this a blocker on 0.20.5.   Need to fix it for this user at least.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2617) Load balancer falls into pathological state if one server under average - slop; endless churn

Reply via email to