[ 
https://issues.apache.org/jira/browse/HBASE-16604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524489#comment-15524489
 ] 

Enis Soztutar commented on HBASE-16604:
---------------------------------------

bq. Now the client on seeing this exception will try to retry this Exception. 
Since the scanner is removed from the scanner's map and already we have a 
scannerId associated with the scan request,
No, both UnknownScannerException and ScannerResetException extend 
DoNotRetryIOException, client will not retry with the same scanner id. This 
means that the RPC retrying mechanism (RPCRetryingCaller, 
ScannerCallableWithReplicas, etc) is not gonna be retried. However, at a higher 
level, there is a retry-from-where-you-are-left mechanism within ClientScanner. 
Thus, ClientScanner will re-open a new RegionScanner by sending a new scan 
request and get a new scanner name. This logic is in ClientScanner: 
{code}
 // If exception is any but the list below throw it back to the client; else 
setup
        // the scanner and retry.
        Throwable cause = e.getCause();
        if ((cause != null && cause instanceof NotServingRegionException) ||
            (cause != null && cause instanceof RegionServerStoppedException) ||
            e instanceof OutOfOrderScannerNextException ||
            e instanceof UnknownScannerException ||
            e instanceof ScannerResetException) {
          // Pass. It is easier writing the if loop test as list of what is 
allowed rather than
          // as a list of what is not allowed... so if in here, it means we do 
not throw.
        } else {
          throw e;
        }
{code}
The client will also toss-away any partial results so far, and continue the 
scan from the last known row. 

bq. ->In case of actual retries whether the scanner internals and its heap are 
reset properly
The heap will be reset correctly, because the region scanner is closed for 
good. A completely new RegionScanner will be constructed from scratch.
bq. -> In case my retries are over how am I cleaning up the heap and also the 
blocks. This will happen only for master branch I think and we need to fix only 
in 2.0.
We close the scanner and remove the lease already. We set the rpcCallback which 
will get run and call shipped(), no? Is it the case that if the scanner is 
already closed, shipped() will not free up the blocks? 

bq. One more thing is that since closeScanner is getting called even on 
exception the CP hooks preScannerClose and postScannerClose are getting called. 
Is that expected?
Yes, I have checked that in other contexts where we close the scanner in case 
of exception, we still call the coprocessor methods. 



> Scanner retries on IOException can cause the scans to miss data 
> ----------------------------------------------------------------
>
>                 Key: HBASE-16604
>                 URL: https://issues.apache.org/jira/browse/HBASE-16604
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, Scanners
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
>         Attachments: HBASE-16604-branch-1.3-addendum.patch, 
> hbase-16604_v1.patch, hbase-16604_v2.patch, hbase-16604_v3.branch-1.patch, 
> hbase-16604_v3.patch
>
>
> Debugging an ITBLL failure, where the Verify did not "see" all the data in 
> the cluster, I've noticed that if we end up getting a generic IOException 
> from the HFileReader level, we may end up missing the rest of the data in the 
> region. I was able to manually test this, and this stack trace helps to 
> understand what is going on: 
> {code}
> 2016-09-09 16:27:15,633 INFO  [hconnection-0x71ad3d8a-shared--pool21-t9] 
> client.ScannerCallable(376): Open scanner=1 for 
> scan={"loadColumnFamiliesOnDemand":null,"startRow":"","stopRow":"","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":2097152,"families":{"testFamily":["testFamily"]},"caching":100,"maxVersions":1,"timeRange":[0,9223372036854775807]}
>  on region 
> region=testScanThrowsException,,1473463632707.b2adfb618e5d0fe225c1dc40c0eabfee.,
>  hostname=hw10676,51833,1473463626529, seqNum=2
> 2016-09-09 16:27:15,634 INFO  
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833] 
> regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows: 
> 100 close_scanner: false next_call_seq: 0 client_handles_partials: true 
> client_handles_heartbeats: true renew: false
> 2016-09-09 16:27:15,635 INFO  
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833] 
> regionserver.RSRpcServices(2510): Rolling back next call seqId
> 2016-09-09 16:27:15,635 INFO  
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833] 
> regionserver.RSRpcServices(2565): Throwing new 
> ServiceExceptionjava.io.IOException: Could not reseek 
> StoreFileScanner[HFileScanner for reader 
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
>  compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, 
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112, 
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, 
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, 
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, 
> avgValueLen=3, entries=17576, length=866998, 
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key 
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
> 2016-09-09 16:27:15,635 DEBUG 
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833] ipc.CallRunner(110): 
> B.fifo.QRpcServer.handler=5,queue=0,port=51833: callId: 26 service: 
> ClientService methodName: Scan size: 26 connection: 192.168.42.75:51903
> java.io.IOException: Could not reseek StoreFileScanner[HFileScanner for 
> reader 
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
>  compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, 
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112, 
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, 
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, 
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, 
> avgValueLen=3, entries=17576, length=866998, 
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key 
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
>       at 
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
>       at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
>       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>       at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
>       at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: foo bar baz
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
>       ... 17 more
> 2016-09-09 16:27:15,653 INFO  [hconnection-0x71ad3d8a-shared--pool21-t10] 
> client.ScannerCallable(261): Got exception making request scanner_id: 1 
> number_of_rows: 100 close_scanner: false next_call_seq: 0 
> client_handles_partials: true client_handles_heartbeats: true renew: false to 
> region=testScanThrowsException,,1473463632707.b2adfb618e5d0fe225c1dc40c0eabfee.,
>  hostname=hw10676,51833,1473463626529, seqNum=2
> java.io.IOException: java.io.IOException: Could not reseek 
> StoreFileScanner[HFileScanner for reader 
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
>  compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, 
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112, 
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, 
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, 
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, 
> avgValueLen=3, entries=17576, length=866998, 
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key 
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
>       at 
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
>       at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
>       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>       at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
>       at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: foo bar baz
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
>       ... 17 more
>       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>       at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>       at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>       at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>       at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
>       at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:332)
>       at 
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:256)
>       at 
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:1)
>       at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
>       at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:364)
>       at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:1)
>       at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
>       at 
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException): 
> java.io.IOException: Could not reseek StoreFileScanner[HFileScanner for 
> reader 
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
>  compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, 
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112, 
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, 
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, 
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, 
> avgValueLen=3, entries=17576, length=866998, 
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key 
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
>       at 
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
>       at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
>       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>       at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
>       at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: foo bar baz
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
>       ... 17 more
>       at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1225)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32741)
>       at 
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:214)
>       ... 9 more
> 2016-09-09 16:27:15,963 INFO  
> [B.priority.fifo.QRpcServer.handler=8,queue=0,port=51824] 
> regionserver.RSRpcServices(2196): scan request:region { type: REGION_NAME 
> value: "hbase:meta,,1" } scan { start_row: 
> "testScanThrowsException,,99999999999999" max_versions: 1 cache_blocks: true 
> small: true reversed: true caching: 1 } number_of_rows: 1 close_scanner: true 
> client_handles_partials: true client_handles_heartbeats: true
> 2016-09-09 16:27:15,965 INFO  
> [B.priority.fifo.QRpcServer.handler=8,queue=0,port=51824] 
> regionserver.RSRpcServices(2549): class 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:cells_per_result:
>  4 scanner_id: 15 more_results: false stale: false partial_flag_per_result: 
> false more_results_in_region: true heartbeat_message: false
> 2016-09-09 16:27:15,967 INFO  
> [B.fifo.QRpcServer.handler=0,queue=0,port=51833] 
> regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows: 
> 100 close_scanner: false next_call_seq: 0 client_handles_partials: true 
> client_handles_heartbeats: true renew: false
> 2016-09-09 16:27:15,967 INFO  
> [B.fifo.QRpcServer.handler=0,queue=0,port=51833] 
> regionserver.RSRpcServices(2549): class 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:scanner_id:
>  1 more_results: true ttl: 60000 stale: false more_results_in_region: false
> 2016-09-09 16:27:15,968 INFO  
> [B.fifo.QRpcServer.handler=1,queue=0,port=51833] 
> regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows: 
> 0 close_scanner: true client_handles_partials: true 
> client_handles_heartbeats: true
> 2016-09-09 16:27:15,968 INFO  
> [B.fifo.QRpcServer.handler=1,queue=0,port=51833] 
> regionserver.RSRpcServices(2549): class 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:scanner_id:
>  1 more_results: false
> 2016-09-09 16:27:15,971 INFO  [main] hbase.HBaseTestingUtility(1111): 
> Shutting down minicluster
> {code}
> The ScannerCallable is retryable if it gets a IOException. Before the 
> HBASE-13662 fix, the retry was still failing due to 
> OutOfOrderScannerException, which then causes the Scanner stack to be reset, 
> thus masking the issue. After HBASE-13662, we can retry the same scan, but in 
> case of we are doing a {{reseek()}} which fails in middle with an 
> IOException, the KVHeaps and the Scanner state on the server side is left 
> dirty. 
> It seems to be much safer to reset the scanner state in case of IOExceptions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to