[jira] [Updated] (HBASE-21941) Use operation timeout instead of scanner timeout for scan call

Guanghao Zhang (JIRA) Thu, 21 Feb 2019 02:07:43 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-21941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Guanghao Zhang updated HBASE-21941:
-----------------------------------
    Description: 
There are hbase.rpc.timeout and hbase.client.operation.timeout for client 
operation expect scan. And there is a special config 
hbase.client.scanner.timeout.period for scan. If I am not wrong, this should 
rpc timeout of scan call. But now we use this as operation timeout of scan 
call. The scan callable is complicated as we need handle the replica case. The 
real call with retry is called in 
[https://github.com/apache/hbase/blob/9a55cbb2c1dfe5a13a6ceb323ac7edd23532f4b5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ResultBoundedCompletionService.java#L80|https://github.com/apache/hbase/blob/9a55cbb2c1dfe5a13a6ceb323ac7edd23532f4b5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ResultBoundedCompletionService.java#L80.]
 . And the callTimeout is configed by hbase.client.scanner.timeout.period. So I 
thought this is not right.

 

I meet this problem when run ITBLL for branch-2.2. The verify map task failed 
when scan.
{code:java}
2019-02-21 03:47:20,287 INFO [main] 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: recovered from 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=16, exceptions: 
2019-02-21 03:47:20,287 INFO [main] 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Closing the previously 
opened scanner object.
2019-02-21 03:47:20,331 INFO [main] 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Current 
scan={"loadColumnFamiliesOnDemand":null,"startRow":"\\xE1\\x9B\\xB4\\xF0\\xB3(JT\\xDC\\x86pf|y\\xF3\\xE9","stopRow":"","batch":-1,"cacheBlocks":false,"totalColumns":3,"maxResultSize":4194304,"families":{"big":["big"],"meta":["prev"],"tiny":["tiny"]},"caching":10000,"maxVersions":1,"timeRange":[0,9223372036854775807]}
 2019-02-21 03:47:20,335 INFO 
[hconnection-0x7b44b63d-metaLookup-shared--pool4-t36] 
org.apache.hadoop.hbase.client.ScannerCallable: Open 
scanner=-4916858472898750097 for 
scan={"loadColumnFamiliesOnDemand":null,"startRow":"IntegrationTestBigLinkedList,\\xE1\\x9B\\xB4\\xF0\\xB3(JT\\xDC\\x86pf|y\\xF3\\xE9,99999999999999","stopRow":"IntegrationTestBigLinkedList,,","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":-1,"families":{"info":["ALL"]},"caching":5,"maxVersions":1,"timeRange":[0,9223372036854775807]}
 on region region=hbase:meta,,1.1588230740, 
hostname=c4-hadoop-tst-st26.bj,29100,1550660298519, seqNum=-1
2019-02-21 03:48:20,354 INFO [main] 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Mapper took 60023ms to 
process 0 rows
2019-02-21 03:48:20,355 INFO [main] 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=16, exceptions: Thu Feb 21 03:48:20 CST 2019, null, 
java.net.SocketTimeoutException: callTimeout=60000, callDuration=60215: Call to 
c4-hadoop-tst-st30.bj/10.132.2.41:29100 failed on local exception: 
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=7102, waitTime=60006, 
rpcTimeout=60000 row 'ᛴ�(JT܆pf|y��' on table 'IntegrationTestBigLinkedList' at 
region=IntegrationTestBigLinkedList,\xDD\xDD\xDD\xDD\xDD\xDD\xDD\xDD,1550661322522.d5d29d2f1e8fee42d666c117709c3a46.,
 hostname=c4-hadoop-tst-st30.bj,29100,1550652984371, seqNum=1007960 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=16, exceptions: Thu Feb 21 03:48:20 CST 2019, null, 
java.net.SocketTimeoutException: callTimeout=60000, callDuration=60215: Call to 
c4-hadoop-tst-st30.bj/10.132.2.41:29100 failed on local exception: 
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=7102, waitTime=60006, 
rpcTimeout=60000 row 'ᛴ�(JT܆pf|y��' on table 'IntegrationTestBigLinkedList' at 
region=IntegrationTestBigLinkedList,\xDD\xDD\xDD\xDD\xDD\xDD\xDD\xDD,1550661322522.d5d29d2f1e8fee42d666c117709c3a46.,
 hostname=c4-hadoop-tst-st30.bj,29100,1550652984371, seqNum=1007960 at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:299)
 at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:242)
 at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:58)
 at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
 at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:266) 
at 
org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:434) 
at 
org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:309)
 at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:594) 
at 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:237)
{code}
 

When TableRecordReaderImpl#nextKeyValue first scan faile, it recoverd by close 
the old scaner and open a new one. But the new scanner failed after 60 seconds. 
Then it throw the exception and don't try to recovery and make the task failed.

 

Or if the scanner timeout is a operation timeout of scan call, we should set 
the default scanner timeout to 1200000 which is same with the default operation 
timeout.

 

 

 

  was:
There are hbase.rpc.timeout and hbase.client.operation.timeout for client 
operation expect scan. And there is a special config 
hbase.client.scanner.timeout.period for scan. If I am not wrong, this should 
rpc timeout of scan call. But now we use this as operation timeout of scan 
call. The scan callable is complicated as we need handle the replica case. The 
real call with retry is called in 
[https://github.com/apache/hbase/blob/9a55cbb2c1dfe5a13a6ceb323ac7edd23532f4b5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ResultBoundedCompletionService.java#L80|https://github.com/apache/hbase/blob/9a55cbb2c1dfe5a13a6ceb323ac7edd23532f4b5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ResultBoundedCompletionService.java#L80.]
 . And the callTimeout is configed by hbase.client.scanner.timeout.period. So I 
thought this is not right.

 

I meet this problem when run ITBLL for branch-2.2. The verify map task failed 
when scan.
{code:java}
2019-02-21 03:47:20,287 INFO [main] 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: recovered from 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=16, exceptions: 
2019-02-21 03:47:20,287 INFO [main] 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Closing the previously 
opened scanner object.
2019-02-21 03:47:20,331 INFO [main] 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Current 
scan={"loadColumnFamiliesOnDemand":null,"startRow":"\\xE1\\x9B\\xB4\\xF0\\xB3(JT\\xDC\\x86pf|y\\xF3\\xE9","stopRow":"","batch":-1,"cacheBlocks":false,"totalColumns":3,"maxResultSize":4194304,"families":{"big":["big"],"meta":["prev"],"tiny":["tiny"]},"caching":10000,"maxVersions":1,"timeRange":[0,9223372036854775807]}
 2019-02-21 03:47:20,335 INFO 
[hconnection-0x7b44b63d-metaLookup-shared--pool4-t36] 
org.apache.hadoop.hbase.client.ScannerCallable: Open 
scanner=-4916858472898750097 for 
scan={"loadColumnFamiliesOnDemand":null,"startRow":"IntegrationTestBigLinkedList,\\xE1\\x9B\\xB4\\xF0\\xB3(JT\\xDC\\x86pf|y\\xF3\\xE9,99999999999999","stopRow":"IntegrationTestBigLinkedList,,","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":-1,"families":{"info":["ALL"]},"caching":5,"maxVersions":1,"timeRange":[0,9223372036854775807]}
 on region region=hbase:meta,,1.1588230740, 
hostname=c4-hadoop-tst-st26.bj,29100,1550660298519, seqNum=-1
2019-02-21 03:48:20,354 INFO [main] 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Mapper took 60023ms to 
process 0 rows
2019-02-21 03:48:20,355 INFO [main] 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=16, exceptions: Thu Feb 21 03:48:20 CST 2019, null, 
java.net.SocketTimeoutException: callTimeout=60000, callDuration=60215: Call to 
c4-hadoop-tst-st30.bj/10.132.2.41:29100 failed on local exception: 
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=7102, waitTime=60006, 
rpcTimeout=60000 row 'ᛴ�(JT܆pf|y��' on table 'IntegrationTestBigLinkedList' at 
region=IntegrationTestBigLinkedList,\xDD\xDD\xDD\xDD\xDD\xDD\xDD\xDD,1550661322522.d5d29d2f1e8fee42d666c117709c3a46.,
 hostname=c4-hadoop-tst-st30.bj,29100,1550652984371, seqNum=1007960 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=16, exceptions: Thu Feb 21 03:48:20 CST 2019, null, 
java.net.SocketTimeoutException: callTimeout=60000, callDuration=60215: Call to 
c4-hadoop-tst-st30.bj/10.132.2.41:29100 failed on local exception: 
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=7102, waitTime=60006, 
rpcTimeout=60000 row 'ᛴ�(JT܆pf|y��' on table 'IntegrationTestBigLinkedList' at 
region=IntegrationTestBigLinkedList,\xDD\xDD\xDD\xDD\xDD\xDD\xDD\xDD,1550661322522.d5d29d2f1e8fee42d666c117709c3a46.,
 hostname=c4-hadoop-tst-st30.bj,29100,1550652984371, seqNum=1007960 at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:299)
 at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:242)
 at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:58)
 at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
 at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:266) 
at 
org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:434) 
at 
org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:309)
 at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:594) 
at 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:237)
{code}
 

When TableRecordReaderImpl#nextKeyValue first scan faile, it recoverd by close 
the old scaner and open a new one. But the new scanner failed after 60 seconds. 
Then it throw the exception and don't try to recovery and make the task failed.

 

 

 


> Use operation timeout instead of scanner timeout for scan call
> --------------------------------------------------------------
>
>                 Key: HBASE-21941
>                 URL: https://issues.apache.org/jira/browse/HBASE-21941
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Guanghao Zhang
>            Priority: Major
>
> There are hbase.rpc.timeout and hbase.client.operation.timeout for client 
> operation expect scan. And there is a special config 
> hbase.client.scanner.timeout.period for scan. If I am not wrong, this should 
> rpc timeout of scan call. But now we use this as operation timeout of scan 
> call. The scan callable is complicated as we need handle the replica case. 
> The real call with retry is called in 
> [https://github.com/apache/hbase/blob/9a55cbb2c1dfe5a13a6ceb323ac7edd23532f4b5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ResultBoundedCompletionService.java#L80|https://github.com/apache/hbase/blob/9a55cbb2c1dfe5a13a6ceb323ac7edd23532f4b5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ResultBoundedCompletionService.java#L80.]
>  . And the callTimeout is configed by hbase.client.scanner.timeout.period. So 
> I thought this is not right.
>  
> I meet this problem when run ITBLL for branch-2.2. The verify map task failed 
> when scan.
> {code:java}
> 2019-02-21 03:47:20,287 INFO [main] 
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: recovered from 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=16, exceptions: 
> 2019-02-21 03:47:20,287 INFO [main] 
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Closing the 
> previously opened scanner object.
> 2019-02-21 03:47:20,331 INFO [main] 
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Current 
> scan={"loadColumnFamiliesOnDemand":null,"startRow":"\\xE1\\x9B\\xB4\\xF0\\xB3(JT\\xDC\\x86pf|y\\xF3\\xE9","stopRow":"","batch":-1,"cacheBlocks":false,"totalColumns":3,"maxResultSize":4194304,"families":{"big":["big"],"meta":["prev"],"tiny":["tiny"]},"caching":10000,"maxVersions":1,"timeRange":[0,9223372036854775807]}
>  2019-02-21 03:47:20,335 INFO 
> [hconnection-0x7b44b63d-metaLookup-shared--pool4-t36] 
> org.apache.hadoop.hbase.client.ScannerCallable: Open 
> scanner=-4916858472898750097 for 
> scan={"loadColumnFamiliesOnDemand":null,"startRow":"IntegrationTestBigLinkedList,\\xE1\\x9B\\xB4\\xF0\\xB3(JT\\xDC\\x86pf|y\\xF3\\xE9,99999999999999","stopRow":"IntegrationTestBigLinkedList,,","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":-1,"families":{"info":["ALL"]},"caching":5,"maxVersions":1,"timeRange":[0,9223372036854775807]}
>  on region region=hbase:meta,,1.1588230740, 
> hostname=c4-hadoop-tst-st26.bj,29100,1550660298519, seqNum=-1
> 2019-02-21 03:48:20,354 INFO [main] 
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Mapper took 60023ms 
> to process 0 rows
> 2019-02-21 03:48:20,355 INFO [main] 
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=16, exceptions: Thu Feb 21 03:48:20 CST 2019, null, 
> java.net.SocketTimeoutException: callTimeout=60000, callDuration=60215: Call 
> to c4-hadoop-tst-st30.bj/10.132.2.41:29100 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=7102, 
> waitTime=60006, rpcTimeout=60000 row 'ᛴ�(JT܆pf|y��' on table 
> 'IntegrationTestBigLinkedList' at 
> region=IntegrationTestBigLinkedList,\xDD\xDD\xDD\xDD\xDD\xDD\xDD\xDD,1550661322522.d5d29d2f1e8fee42d666c117709c3a46.,
>  hostname=c4-hadoop-tst-st30.bj,29100,1550652984371, seqNum=1007960 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=16, exceptions: Thu Feb 21 03:48:20 CST 2019, null, 
> java.net.SocketTimeoutException: callTimeout=60000, callDuration=60215: Call 
> to c4-hadoop-tst-st30.bj/10.132.2.41:29100 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=7102, 
> waitTime=60006, rpcTimeout=60000 row 'ᛴ�(JT܆pf|y��' on table 
> 'IntegrationTestBigLinkedList' at 
> region=IntegrationTestBigLinkedList,\xDD\xDD\xDD\xDD\xDD\xDD\xDD\xDD,1550661322522.d5d29d2f1e8fee42d666c117709c3a46.,
>  hostname=c4-hadoop-tst-st30.bj,29100,1550652984371, seqNum=1007960 at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:299)
>  at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:242)
>  at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:58)
>  at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
>  at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:266) 
> at 
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:434)
>  at 
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:309)
>  at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:594) 
> at 
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:237)
> {code}
>  
> When TableRecordReaderImpl#nextKeyValue first scan faile, it recoverd by 
> close the old scaner and open a new one. But the new scanner failed after 60 
> seconds. Then it throw the exception and don't try to recovery and make the 
> task failed.
>  
> Or if the scanner timeout is a operation timeout of scan call, we should set 
> the default scanner timeout to 1200000 which is same with the default 
> operation timeout.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21941) Use operation timeout instead of scanner timeout for scan call

Reply via email to