[ 
https://issues.apache.org/jira/browse/HBASE-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13560343#comment-13560343
 ] 

Sergey Shelukhin commented on HBASE-7520:
-----------------------------------------

There are 2 separate issues there (at least in the failure I am currently 
looking at).
1) Retries are not smart. Gradual fallback doesn't make a lot of sense when we 
are going to different servers. E.g. if we just had 8 retries from server FOO, 
and now read that region is on BAR in meta, no sense to wait 30 seconds before 
going to BAR. I'll file a JIRA.
2) There's some sort of bug in assignment that necessitates the timeoutmonitor 
to kick in.
Here's the analysis.

Cruft removed from logs for convenience.
Note that "Sending ... " log is BEFORE the delay.
Servers are indicated by ports in the text.

Here the request starts. Some normal test errors we are supposed to recover 
from happen.
Region is closing (being moved) on 51231, then we get RegionMoved after retry 
to 51231, then we get error from destination 50661 as region is still opening.
{code}
2013-01-22 17:59:00,537  Retrying due to errors: Error #0 from 
[10.11.2.92:51231] for 
[6b03eab4f067cf8739e85bd408f05d94-27118]org.apache.hadoop.hbase.NotServingRegionException:
 org.apache.hadoop.hbase.NotServingRegionException: 
IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
 is closing
2013-01-22 17:59:00,571  Sending requests to [10.11.2.92:51231] with delay of 
[1001] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:01,577  Retrying due to errors: Error #0 from 
[10.11.2.92:51231] for 
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
 org.apache.hadoop.hbase.RegionMovedException: Region moved to: 
hostname=10.11.2.92 port=50661. As of locationSeqNum=129492.
2013-01-22 17:59:01,577  Sending requests to [10.11.2.92:50661] with delay of 
[1009] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:02,591  Retrying due to errors: Error #0 from 
[10.11.2.92:50661] for 
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
 org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Region is not online: 
0200b366bc37c5afd1185f7d487c7dfb
2013-01-22 17:59:02,591  Sending requests to [10.11.2.92:51231] with delay of 
[2002] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
{code}

ChaosMonkey kills the server which the region is moving to (50661). 
We, meanwhile, cycle back to 51231 based on meta, get redirect again, and go to 
50661 again. But it's now dead.
{code}
2013-01-22 17:59:03,121 INFO  [Thread-506] util.ChaosMonkey$Action(179): 
Killing region server:10.11.2.92,50661,1358906192942
2013-01-22 17:59:04,594  Retrying due to errors: Error #0 from 
[10.11.2.92:51231] for 
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
 org.apache.hadoop.hbase.RegionMovedException: Region moved to: 
hostname=10.11.2.92 port=50661. As of locationSeqNum=129492.
2013-01-22 17:59:04,595  Sending requests to [10.11.2.92:50661] with delay of 
[2014] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:06,609  Retrying due to errors: Error #0 from 
[10.11.2.92:50661] for 
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
 org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is 
in the failed servers list: /10.11.2.92:50661
2013-01-22 17:59:06,616  Sending requests to [10.11.2.92:50661] with delay of 
[4004] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
{code}

At this point we keep retrying against 50661, which is in META.
{code}
2013-01-22 17:59:08,631 Current INFO from scan results = {NAME => 
'IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.',
 STARTKEY => '66666660', ENDKEY => '7333332c', 
ENCODED => 0200b366bc37c5afd1185f7d487c7dfb,}
2013-01-22 17:59:08,631 Cached location for 
IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
 is 10.11.2.92:50661
2013-01-22 17:59:10,621  Retrying due to errors: Error #0 from 
[10.11.2.92:50661] for 
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
 org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is 
in the failed servers list: /10.11.2.92:50661
2013-01-22 17:59:10,626  Sending requests to [10.11.2.92:50661] with delay of 
[4025] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:14,653  Retrying due to errors: Error #0 from 
[10.11.2.92:50661] for 
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
 org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is 
in the failed servers list: /10.11.2.92:50661
2013-01-22 17:59:14,662  Sending requests to [10.11.2.92:50661] with delay of 
[8026] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:22,689  Retrying due to errors: Error #0 from 
[10.11.2.92:50661] for 
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
 org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is 
in the failed servers list: /10.11.2.92:50661
2013-01-22 17:59:22,699  Sending requests to [10.11.2.92:50661] with delay of 
[16151] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:28,748 Current INFO from scan results = {NAME => 
'IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.',
 STARTKEY => '66666660', ENDKEY => '7333332c', 
ENCODED => 0200b366bc37c5afd1185f7d487c7dfb,}
2013-01-22 17:59:28,748 Cached location for 
IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
 is 10.11.2.92:50661
{code}

Master thinks assignment timed out, despite region being in meta!
{code}
2013-01-22 17:59:34,011 INFO  [10.11.2.92,50658,1358906192673.timeoutMonitor] 
master.AssignmentManager$TimeoutMonitor(2560): Regions in transition timed out: 
 
{IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
 state=OPENING, ts=0, server=10.11.2.92,50661,1358906192942}
{code}

Region is opened on 52058 thanks to timeout monitor, and we finally get correct 
location. 
However, by this time the combination of the above delay and growing request 
delay exceed ChaosMonkey period, so it moves the region again and we finally 
run out of retries.
{code}
2013-01-22 17:59:38,852  Retrying due to errors: Error #0 from 
[10.11.2.92:50661] for 
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
 java.net.ConnectException: Connection refused
2013-01-22 17:59:38,860  Sending requests to [10.11.2.92:52068] with delay of 
[32199] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 18:00:05,481 DEBUG [Thread-506] util.ChaosMonkey$Action(203): 
Removing 2 regions from 10.11.2.92,52068,1358906350270
2013-01-22 18:00:11,070  Failed to insert: 27118; region information: cached: 
region=IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.,
 hostname=10.11.2.92, port=51015, seqNum=148999; cache is up to date; errors: 
Error #0 from [10.11.2.92:52068] for 
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
 org.apache.hadoop.hbase.RegionMovedException: Region moved to: 
hostname=10.11.2.92 port=51015. As of locationSeqNum=148999.
{code}


                
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted fails 
> when I cd hbase-it and mvn verify
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-7520
>                 URL: https://issues.apache.org/jira/browse/HBASE-7520
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>         Environment: macosx trunk
>            Reporter: stack
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>
> Trying to make up something to hand off to bigtop project, running the hbase 
> it tests, this one fails.
> {code}
> durruti:failsafe-reports stack$ more 
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted.txt 
> -------------------------------------------------------------------------------
> Test set: 
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted
> -------------------------------------------------------------------------------
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 206.538 sec 
> <<< FAILURE!
> testDataIngest(org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted)
>   Time elapsed: 206.395 sec  <<< FAILURE!
> junit.framework.AssertionFailedError: Load failed with error code 1
>         at junit.framework.Assert.fail(Assert.java:50)
>         at 
> org.apache.hadoop.hbase.IngestIntegrationTestBase.runIngestTest(IngestIntegrationTestBase.java:98)
>         at 
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted.testDataIngest(IntegrationTestRebalanceAndKillServersTargeted.java:121)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ...
> {code}
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted-output.txt
>   has nothing in it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to