[
https://issues.apache.org/jira/browse/HBASE-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13560343#comment-13560343
]
Sergey Shelukhin commented on HBASE-7520:
-----------------------------------------
There are 2 separate issues there (at least in the failure I am currently
looking at).
1) Retries are not smart. Gradual fallback doesn't make a lot of sense when we
are going to different servers. E.g. if we just had 8 retries from server FOO,
and now read that region is on BAR in meta, no sense to wait 30 seconds before
going to BAR. I'll file a JIRA.
2) There's some sort of bug in assignment that necessitates the timeoutmonitor
to kick in.
Here's the analysis.
Cruft removed from logs for convenience.
Note that "Sending ... " log is BEFORE the delay.
Servers are indicated by ports in the text.
Here the request starts. Some normal test errors we are supposed to recover
from happen.
Region is closing (being moved) on 51231, then we get RegionMoved after retry
to 51231, then we get error from destination 50661 as region is still opening.
{code}
2013-01-22 17:59:00,537 Retrying due to errors: Error #0 from
[10.11.2.92:51231] for
[6b03eab4f067cf8739e85bd408f05d94-27118]org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException:
IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
is closing
2013-01-22 17:59:00,571 Sending requests to [10.11.2.92:51231] with delay of
[1001] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:01,577 Retrying due to errors: Error #0 from
[10.11.2.92:51231] for
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.RegionMovedException: Region moved to:
hostname=10.11.2.92 port=50661. As of locationSeqNum=129492.
2013-01-22 17:59:01,577 Sending requests to [10.11.2.92:50661] with delay of
[1009] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:02,591 Retrying due to errors: Error #0 from
[10.11.2.92:50661] for
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
0200b366bc37c5afd1185f7d487c7dfb
2013-01-22 17:59:02,591 Sending requests to [10.11.2.92:51231] with delay of
[2002] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
{code}
ChaosMonkey kills the server which the region is moving to (50661).
We, meanwhile, cycle back to 51231 based on meta, get redirect again, and go to
50661 again. But it's now dead.
{code}
2013-01-22 17:59:03,121 INFO [Thread-506] util.ChaosMonkey$Action(179):
Killing region server:10.11.2.92,50661,1358906192942
2013-01-22 17:59:04,594 Retrying due to errors: Error #0 from
[10.11.2.92:51231] for
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.RegionMovedException: Region moved to:
hostname=10.11.2.92 port=50661. As of locationSeqNum=129492.
2013-01-22 17:59:04,595 Sending requests to [10.11.2.92:50661] with delay of
[2014] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:06,609 Retrying due to errors: Error #0 from
[10.11.2.92:50661] for
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is
in the failed servers list: /10.11.2.92:50661
2013-01-22 17:59:06,616 Sending requests to [10.11.2.92:50661] with delay of
[4004] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
{code}
At this point we keep retrying against 50661, which is in META.
{code}
2013-01-22 17:59:08,631 Current INFO from scan results = {NAME =>
'IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.',
STARTKEY => '66666660', ENDKEY => '7333332c',
ENCODED => 0200b366bc37c5afd1185f7d487c7dfb,}
2013-01-22 17:59:08,631 Cached location for
IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
is 10.11.2.92:50661
2013-01-22 17:59:10,621 Retrying due to errors: Error #0 from
[10.11.2.92:50661] for
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is
in the failed servers list: /10.11.2.92:50661
2013-01-22 17:59:10,626 Sending requests to [10.11.2.92:50661] with delay of
[4025] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:14,653 Retrying due to errors: Error #0 from
[10.11.2.92:50661] for
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is
in the failed servers list: /10.11.2.92:50661
2013-01-22 17:59:14,662 Sending requests to [10.11.2.92:50661] with delay of
[8026] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:22,689 Retrying due to errors: Error #0 from
[10.11.2.92:50661] for
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is
in the failed servers list: /10.11.2.92:50661
2013-01-22 17:59:22,699 Sending requests to [10.11.2.92:50661] with delay of
[16151] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:28,748 Current INFO from scan results = {NAME =>
'IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.',
STARTKEY => '66666660', ENDKEY => '7333332c',
ENCODED => 0200b366bc37c5afd1185f7d487c7dfb,}
2013-01-22 17:59:28,748 Cached location for
IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
is 10.11.2.92:50661
{code}
Master thinks assignment timed out, despite region being in meta!
{code}
2013-01-22 17:59:34,011 INFO [10.11.2.92,50658,1358906192673.timeoutMonitor]
master.AssignmentManager$TimeoutMonitor(2560): Regions in transition timed out:
{IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
state=OPENING, ts=0, server=10.11.2.92,50661,1358906192942}
{code}
Region is opened on 52058 thanks to timeout monitor, and we finally get correct
location.
However, by this time the combination of the above delay and growing request
delay exceed ChaosMonkey period, so it moves the region again and we finally
run out of retries.
{code}
2013-01-22 17:59:38,852 Retrying due to errors: Error #0 from
[10.11.2.92:50661] for
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
java.net.ConnectException: Connection refused
2013-01-22 17:59:38,860 Sending requests to [10.11.2.92:52068] with delay of
[32199] for rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 18:00:05,481 DEBUG [Thread-506] util.ChaosMonkey$Action(203):
Removing 2 regions from 10.11.2.92,52068,1358906350270
2013-01-22 18:00:11,070 Failed to insert: 27118; region information: cached:
region=IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.,
hostname=10.11.2.92, port=51015, seqNum=148999; cache is up to date; errors:
Error #0 from [10.11.2.92:52068] for
[6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.RegionMovedException: Region moved to:
hostname=10.11.2.92 port=51015. As of locationSeqNum=148999.
{code}
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted fails
> when I cd hbase-it and mvn verify
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-7520
> URL: https://issues.apache.org/jira/browse/HBASE-7520
> Project: HBase
> Issue Type: Bug
> Components: test
> Environment: macosx trunk
> Reporter: stack
> Assignee: Sergey Shelukhin
> Priority: Critical
>
> Trying to make up something to hand off to bigtop project, running the hbase
> it tests, this one fails.
> {code}
> durruti:failsafe-reports stack$ more
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted.txt
> -------------------------------------------------------------------------------
> Test set:
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted
> -------------------------------------------------------------------------------
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 206.538 sec
> <<< FAILURE!
> testDataIngest(org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted)
> Time elapsed: 206.395 sec <<< FAILURE!
> junit.framework.AssertionFailedError: Load failed with error code 1
> at junit.framework.Assert.fail(Assert.java:50)
> at
> org.apache.hadoop.hbase.IngestIntegrationTestBase.runIngestTest(IngestIntegrationTestBase.java:98)
> at
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted.testDataIngest(IntegrationTestRebalanceAndKillServersTargeted.java:121)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ...
> {code}
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted-output.txt
> has nothing in it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira