[ 
https://issues.apache.org/jira/browse/HBASE-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146054#comment-15146054
 ] 

Sean Busbey commented on HBASE-15219:
-------------------------------------

my concern is that it's difficult to tell what's bad due to preexisting and 
what's bad from the patch. flakey tests on trunk are unlikely to cause findbugs 
issues or hadoop version incompatibilities.

Right now 1.2 is blue and I wouldn't want to jeopardize that as we move into a 
release vote with a short window.

https://builds.apache.org/view/H-L/view/HBase/job/HBase-1.2/lastCompletedBuild/testReport/

Do we have one or more tickets tracking the failures in trunk that you think 
are causing unrelated errors?

> Canary tool does not return non-zero exit code when one of regions is in 
> stuck state 
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-15219
>                 URL: https://issues.apache.org/jira/browse/HBASE-15219
>             Project: HBase
>          Issue Type: Improvement
>          Components: canary
>    Affects Versions: 0.98.16
>            Reporter: Vishal Khandelwal
>            Assignee: Ted Yu
>            Priority: Critical
>             Fix For: 2.0.0, 1.3.0, 1.2.1, 0.98.18
>
>         Attachments: HBASE-15219.v1.patch, HBASE-15219.v3.patch, 
> HBASE-15219.v4.patch, HBASE-15219.v5.patch, HBASE-15219.v7.patch, 
> HBASE-15219.v8.patch
>
>
> {code}
> 2016-02-05 12:24:18,571 ERROR [pool-2-thread-7] tool.Canary - read from 
> region 
> CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418.
>  column family 0 failed
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=2, exceptions:
> Fri Feb 05 12:24:15 GMT 2016, 
> org.apache.hadoop.hbase.client.RpcRetryingCaller@54c9fea0, 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: Region 
> CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418.
>  is not online on isthbase02-dnds1-3-crd.eng.sfdc.net,60020,1454669984738
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2852)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4468)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2984)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31186)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2149)
>       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
>       at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
>       at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
>       at java.lang.Thread.run(Thread.java:745)
> --------
> -bash-4.1$ echo $?
> 0
> {code}
> Below code prints the error but it does sets/returns the exit code. Due to 
> this tool can't be integrated with nagios or other alerting. 
> Ideally it should return error for failures. as pre the documentation:
> <snip>
> This tool will return non zero error codes to user for collaborating with 
> other monitoring tools, such as Nagios. The error code definitions are:
> private static final int USAGE_EXIT_CODE = 1;
> private static final int INIT_ERROR_EXIT_CODE = 2;
> private static final int TIMEOUT_ERROR_EXIT_CODE = 3;
> private static final int ERROR_EXIT_CODE = 4;
> </snip>
> {code}
> org.apache.hadoop.hbase.tool.Canary.RegionTask 
> public Void read() {
>       ....
>       try {
>         table = connection.getTable(region.getTable());
>         tableDesc = table.getTableDescriptor();
>       } catch (IOException e) {
>         LOG.debug("sniffRegion failed", e);
>         sink.publishReadFailure(region, e);
>        ...
>         return null;
>       }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to