[
https://issues.apache.org/jira/browse/HBASE-21941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773912#comment-16773912
]
Duo Zhang commented on HBASE-21941:
-----------------------------------
I think the intention here is that, scan.next is not idempotent, so retry on
timeout usually does not work, as the server will give you an
OutOfOrderException soon. So the rpc timeout is exactly the same with operation
timeout for scan.next.
And I do not fully understand the problem here. You mean we try to open a new
scanner when we hit an exception the first time, and the second time we just
give up without opening a new scanner again?
> Use operation timeout instead of scanner timeout for scan call
> --------------------------------------------------------------
>
> Key: HBASE-21941
> URL: https://issues.apache.org/jira/browse/HBASE-21941
> Project: HBase
> Issue Type: Bug
> Reporter: Guanghao Zhang
> Priority: Major
>
> There are hbase.rpc.timeout and hbase.client.operation.timeout for client
> operation expect scan. And there is a special config
> hbase.client.scanner.timeout.period for scan. If I am not wrong, this should
> rpc timeout of scan call. But now we use this as operation timeout of scan
> call. The scan callable is complicated as we need handle the replica case.
> The real call with retry is called in
> [https://github.com/apache/hbase/blob/9a55cbb2c1dfe5a13a6ceb323ac7edd23532f4b5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ResultBoundedCompletionService.java#L80|https://github.com/apache/hbase/blob/9a55cbb2c1dfe5a13a6ceb323ac7edd23532f4b5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ResultBoundedCompletionService.java#L80.]
> . And the callTimeout is configed by hbase.client.scanner.timeout.period. So
> I thought this is not right.
>
> I meet this problem when run ITBLL for branch-2.2. The verify map task failed
> when scan.
> {code:java}
> 2019-02-21 03:47:20,287 INFO [main]
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: recovered from
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=16, exceptions:
> 2019-02-21 03:47:20,287 INFO [main]
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Closing the
> previously opened scanner object.
> 2019-02-21 03:47:20,331 INFO [main]
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Current
> scan={"loadColumnFamiliesOnDemand":null,"startRow":"\\xE1\\x9B\\xB4\\xF0\\xB3(JT\\xDC\\x86pf|y\\xF3\\xE9","stopRow":"","batch":-1,"cacheBlocks":false,"totalColumns":3,"maxResultSize":4194304,"families":{"big":["big"],"meta":["prev"],"tiny":["tiny"]},"caching":10000,"maxVersions":1,"timeRange":[0,9223372036854775807]}
> 2019-02-21 03:47:20,335 INFO
> [hconnection-0x7b44b63d-metaLookup-shared--pool4-t36]
> org.apache.hadoop.hbase.client.ScannerCallable: Open
> scanner=-4916858472898750097 for
> scan={"loadColumnFamiliesOnDemand":null,"startRow":"IntegrationTestBigLinkedList,\\xE1\\x9B\\xB4\\xF0\\xB3(JT\\xDC\\x86pf|y\\xF3\\xE9,99999999999999","stopRow":"IntegrationTestBigLinkedList,,","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":-1,"families":{"info":["ALL"]},"caching":5,"maxVersions":1,"timeRange":[0,9223372036854775807]}
> on region region=hbase:meta,,1.1588230740,
> hostname=c4-hadoop-tst-st26.bj,29100,1550660298519, seqNum=-1
> 2019-02-21 03:48:20,354 INFO [main]
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Mapper took 60023ms
> to process 0 rows
> 2019-02-21 03:48:20,355 INFO [main]
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=16, exceptions: Thu Feb 21 03:48:20 CST 2019, null,
> java.net.SocketTimeoutException: callTimeout=60000, callDuration=60215: Call
> to c4-hadoop-tst-st30.bj/10.132.2.41:29100 failed on local exception:
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=7102,
> waitTime=60006, rpcTimeout=60000 row 'ᛴ�(JT܆pf|y��' on table
> 'IntegrationTestBigLinkedList' at
> region=IntegrationTestBigLinkedList,\xDD\xDD\xDD\xDD\xDD\xDD\xDD\xDD,1550661322522.d5d29d2f1e8fee42d666c117709c3a46.,
> hostname=c4-hadoop-tst-st30.bj,29100,1550652984371, seqNum=1007960
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=16, exceptions: Thu Feb 21 03:48:20 CST 2019, null,
> java.net.SocketTimeoutException: callTimeout=60000, callDuration=60215: Call
> to c4-hadoop-tst-st30.bj/10.132.2.41:29100 failed on local exception:
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=7102,
> waitTime=60006, rpcTimeout=60000 row 'ᛴ�(JT܆pf|y��' on table
> 'IntegrationTestBigLinkedList' at
> region=IntegrationTestBigLinkedList,\xDD\xDD\xDD\xDD\xDD\xDD\xDD\xDD,1550661322522.d5d29d2f1e8fee42d666c117709c3a46.,
> hostname=c4-hadoop-tst-st30.bj,29100,1550652984371, seqNum=1007960 at
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:299)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:242)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:58)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
> at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:266)
> at
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:434)
> at
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:309)
> at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:594)
> at
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:237)
> {code}
>
> When TableRecordReaderImpl#nextKeyValue first scan faile, it recoverd by
> close the old scaner and open a new one. But the new scanner failed after 60
> seconds. Then it throw the exception and don't try to recovery and make the
> task failed.
>
> Or if the scanner timeout is a operation timeout of scan call, we should set
> the default scanner timeout to 1200000 which is same with the default
> operation timeout.
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)