[
https://issues.apache.org/jira/browse/HBASE-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145250#comment-15145250
]
Ted Yu commented on HBASE-15219:
--------------------------------
I put the following on top of patch v7:
{code}
+ public boolean finalCheckForErrors() {
+ LOG.info("err " + errorCode + " read: " + sink.getReadFailureCount());
{code}
I used same procedure as Maddineni did.
Toward the tail of output, I saw:
{code}
Caused by: java.net.SocketTimeoutException: callTimeout=180000,
callDuration=189240: row '' on table 'tscantbl' at
region=tscantbl,,1453941714280.0f2f1a2fdfa3dad009807fb1b95d3c9a.,
hostname=ted-hbase-insec-4.novalocal,16020,1450214717066, seqNum=2
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
at
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
... 3 more
Caused by: org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Region
tscantbl,,1453941714280.0f2f1a2fdfa3dad009807fb1b95d3c9a. is not online on
ted-hbase-insec-4.novalocal,16020,1450214717066
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2898)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:947)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2235)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:325)
at
org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:380)
at
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:199)
at
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:346)
at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:320)
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
... 4 more
Caused by:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
org.apache.hadoop.hbase.NotServingRegionException: Region
tscantbl,,1453941714280.0f2f1a2fdfa3dad009807fb1b95d3c9a. is not online on
ted-hbase-insec-4.novalocal,16020,1450214717066
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2898)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:947)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2235)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)
at
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1226)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32651)
at
org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:372)
... 10 more
2016-02-12 20:31:47,401 INFO [main] tool.Canary: err 0 read: 1
{code}
Please note the number of read failures (I offlined 1 region).
I am fairly confident my patch did what we agreed on.
> Canary tool does not return non-zero exit code when one of regions is in
> stuck state
> -------------------------------------------------------------------------------------
>
> Key: HBASE-15219
> URL: https://issues.apache.org/jira/browse/HBASE-15219
> Project: HBase
> Issue Type: Bug
> Components: canary
> Affects Versions: 0.98.16
> Reporter: Vishal Khandelwal
> Assignee: Ted Yu
> Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.2.1, 0.98.18
>
> Attachments: HBASE-15219.v1.patch, HBASE-15219.v3.patch,
> HBASE-15219.v4.patch, HBASE-15219.v5.patch, HBASE-15219.v7.patch
>
>
> {code}
> 2016-02-05 12:24:18,571 ERROR [pool-2-thread-7] tool.Canary - read from
> region
> CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418.
> column family 0 failed
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=2, exceptions:
> Fri Feb 05 12:24:15 GMT 2016,
> org.apache.hadoop.hbase.client.RpcRetryingCaller@54c9fea0,
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Region
> CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418.
> is not online on isthbase02-dnds1-3-crd.eng.sfdc.net,60020,1454669984738
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2852)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4468)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2984)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31186)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2149)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
> at java.lang.Thread.run(Thread.java:745)
> --------
> -bash-4.1$ echo $?
> 0
> {code}
> Below code prints the error but it does sets/returns the exit code. Due to
> this tool can't be integrated with nagios or other alerting.
> Ideally it should return error for failures. as pre the documentation:
> <snip>
> This tool will return non zero error codes to user for collaborating with
> other monitoring tools, such as Nagios. The error code definitions are:
> private static final int USAGE_EXIT_CODE = 1;
> private static final int INIT_ERROR_EXIT_CODE = 2;
> private static final int TIMEOUT_ERROR_EXIT_CODE = 3;
> private static final int ERROR_EXIT_CODE = 4;
> </snip>
> {code}
> org.apache.hadoop.hbase.tool.Canary.RegionTask
> public Void read() {
> ....
> try {
> table = connection.getTable(region.getTable());
> tableDesc = table.getTableDescriptor();
> } catch (IOException e) {
> LOG.debug("sniffRegion failed", e);
> sink.publishReadFailure(region, e);
> ...
> return null;
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)