[ 
https://issues.apache.org/jira/browse/HBASE-16604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485409#comment-15485409
 ] 

Enis Soztutar commented on HBASE-16604:
---------------------------------------

Some more info.
KeyValueHeap.generalizedSeek() will leave the heap in "dirty" state by setting 
the {{current = null}} if it gets an IOException: 
{code }
  private boolean generalizedSeek(boolean isLazy, Cell seekKey,
      boolean forward, boolean useBloom) throws IOException {
    if (!isLazy && useBloom) {
      throw new IllegalArgumentException("Multi-column Bloom filter " +
          "optimization requires a lazy seek");
    }

    if (current == null) {
      return false;
    }
    heap.add(current);
    current = null;
    ...
    <throw exception> 
{code}

On the next call, this will return false, indicating that there are no more 
values to return. 

We can deal with this in a couple of different ways: 
(1) Handle IOExceptions in individual KVHeap methods and make sure that state 
is left consistent even in case of IOExceptions. 
(2) Handle IOExceptions in HRegionScannerImpl and reset the whole RegionScanner 
state before returning 
(3) Bubble the exception to the client, but make sure that the scanner is 
thrown away. The client will restart another RegionScanner, and possibly start 
from scanning from the start of the row throwing away partial results. 

I think, doing (1) will be very fragile. (2) also will not work, since there 
should be a way to reset the scanner state reliable in case of an IOException 
coming deep down from FS layer. If we are doing partial results, we maybe left 
in the middle of a row, but with partially seek'ed. Thus I think (2) also won't 
cut. 

(3) is the simplest, which would reset the scanner back to the start of the 
row, and makes sure that the ScannerCallable returns. The challenge with (3) is 
that, we want the ScannerCallable to not retry, but we want the ClientScanner 
to retry. ClientScanner only handles a couple of known exceptions which are 
derivatives of DNRIOE (UnknownScannerException, NotServingRegionException, 
OutOfOrderScannerNextException, etc). We can introduce another exception type 
(ResetScannerException), but we have to be careful for BC for existing clients. 


> Scanner retries on IOException can cause the scans to miss data 
> ----------------------------------------------------------------
>
>                 Key: HBASE-16604
>                 URL: https://issues.apache.org/jira/browse/HBASE-16604
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, Scanners
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
>
> Debugging an ITBLL failure, where the Verify did not "see" all the data in 
> the cluster, I've noticed that if we end up getting a generic IOException 
> from the HFileReader level, we may end up missing the rest of the data in the 
> region. I was able to manually test this, and this stack trace helps to 
> understand what is going on: 
> {code}
> 2016-09-09 16:27:15,633 INFO  [hconnection-0x71ad3d8a-shared--pool21-t9] 
> client.ScannerCallable(376): Open scanner=1 for 
> scan={"loadColumnFamiliesOnDemand":null,"startRow":"","stopRow":"","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":2097152,"families":{"testFamily":["testFamily"]},"caching":100,"maxVersions":1,"timeRange":[0,9223372036854775807]}
>  on region 
> region=testScanThrowsException,,1473463632707.b2adfb618e5d0fe225c1dc40c0eabfee.,
>  hostname=hw10676,51833,1473463626529, seqNum=2
> 2016-09-09 16:27:15,634 INFO  
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833] 
> regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows: 
> 100 close_scanner: false next_call_seq: 0 client_handles_partials: true 
> client_handles_heartbeats: true renew: false
> 2016-09-09 16:27:15,635 INFO  
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833] 
> regionserver.RSRpcServices(2510): Rolling back next call seqId
> 2016-09-09 16:27:15,635 INFO  
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833] 
> regionserver.RSRpcServices(2565): Throwing new 
> ServiceExceptionjava.io.IOException: Could not reseek 
> StoreFileScanner[HFileScanner for reader 
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
>  compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, 
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112, 
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, 
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, 
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, 
> avgValueLen=3, entries=17576, length=866998, 
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key 
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
> 2016-09-09 16:27:15,635 DEBUG 
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833] ipc.CallRunner(110): 
> B.fifo.QRpcServer.handler=5,queue=0,port=51833: callId: 26 service: 
> ClientService methodName: Scan size: 26 connection: 192.168.42.75:51903
> java.io.IOException: Could not reseek StoreFileScanner[HFileScanner for 
> reader 
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
>  compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, 
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112, 
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, 
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, 
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, 
> avgValueLen=3, entries=17576, length=866998, 
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key 
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
>       at 
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
>       at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
>       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>       at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
>       at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: foo bar baz
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
>       ... 17 more
> 2016-09-09 16:27:15,653 INFO  [hconnection-0x71ad3d8a-shared--pool21-t10] 
> client.ScannerCallable(261): Got exception making request scanner_id: 1 
> number_of_rows: 100 close_scanner: false next_call_seq: 0 
> client_handles_partials: true client_handles_heartbeats: true renew: false to 
> region=testScanThrowsException,,1473463632707.b2adfb618e5d0fe225c1dc40c0eabfee.,
>  hostname=hw10676,51833,1473463626529, seqNum=2
> java.io.IOException: java.io.IOException: Could not reseek 
> StoreFileScanner[HFileScanner for reader 
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
>  compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, 
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112, 
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, 
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, 
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, 
> avgValueLen=3, entries=17576, length=866998, 
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key 
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
>       at 
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
>       at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
>       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>       at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
>       at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: foo bar baz
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
>       ... 17 more
>       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>       at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>       at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>       at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>       at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
>       at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:332)
>       at 
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:256)
>       at 
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:1)
>       at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
>       at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:364)
>       at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:1)
>       at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
>       at 
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException): 
> java.io.IOException: Could not reseek StoreFileScanner[HFileScanner for 
> reader 
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
>  compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0, 
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112, 
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368, 
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put, 
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35, 
> avgValueLen=3, entries=17576, length=866998, 
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key 
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
>       at 
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
>       at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
>       at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
>       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>       at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
>       at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: foo bar baz
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
>       ... 17 more
>       at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1225)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32741)
>       at 
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:214)
>       ... 9 more
> 2016-09-09 16:27:15,963 INFO  
> [B.priority.fifo.QRpcServer.handler=8,queue=0,port=51824] 
> regionserver.RSRpcServices(2196): scan request:region { type: REGION_NAME 
> value: "hbase:meta,,1" } scan { start_row: 
> "testScanThrowsException,,99999999999999" max_versions: 1 cache_blocks: true 
> small: true reversed: true caching: 1 } number_of_rows: 1 close_scanner: true 
> client_handles_partials: true client_handles_heartbeats: true
> 2016-09-09 16:27:15,965 INFO  
> [B.priority.fifo.QRpcServer.handler=8,queue=0,port=51824] 
> regionserver.RSRpcServices(2549): class 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:cells_per_result:
>  4 scanner_id: 15 more_results: false stale: false partial_flag_per_result: 
> false more_results_in_region: true heartbeat_message: false
> 2016-09-09 16:27:15,967 INFO  
> [B.fifo.QRpcServer.handler=0,queue=0,port=51833] 
> regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows: 
> 100 close_scanner: false next_call_seq: 0 client_handles_partials: true 
> client_handles_heartbeats: true renew: false
> 2016-09-09 16:27:15,967 INFO  
> [B.fifo.QRpcServer.handler=0,queue=0,port=51833] 
> regionserver.RSRpcServices(2549): class 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:scanner_id:
>  1 more_results: true ttl: 60000 stale: false more_results_in_region: false
> 2016-09-09 16:27:15,968 INFO  
> [B.fifo.QRpcServer.handler=1,queue=0,port=51833] 
> regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows: 
> 0 close_scanner: true client_handles_partials: true 
> client_handles_heartbeats: true
> 2016-09-09 16:27:15,968 INFO  
> [B.fifo.QRpcServer.handler=1,queue=0,port=51833] 
> regionserver.RSRpcServices(2549): class 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:scanner_id:
>  1 more_results: false
> 2016-09-09 16:27:15,971 INFO  [main] hbase.HBaseTestingUtility(1111): 
> Shutting down minicluster
> {code}
> The ScannerCallable is retryable if it gets a IOException. Before the 
> HBASE-13662 fix, the retry was still failing due to 
> OutOfOrderScannerException, which then causes the Scanner stack to be reset, 
> thus masking the issue. After HBASE-13662, we can retry the same scan, but in 
> case of we are doing a {{reseek()}} which fails in middle with an 
> IOException, the KVHeaps and the Scanner state on the server side is left 
> dirty. 
> It seems to be much safer to reset the scanner state in case of IOExceptions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to