[
https://issues.apache.org/jira/browse/HBASE-16604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Busbey updated HBASE-16604:
--------------------------------
Priority: Critical (was: Major)
> Scanner retries on IOException can cause the scans to miss data
> ----------------------------------------------------------------
>
> Key: HBASE-16604
> URL: https://issues.apache.org/jira/browse/HBASE-16604
> Project: HBase
> Issue Type: Bug
> Components: regionserver, Scanners
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
> Attachments: HBASE-16604-branch-1.3-addendum.patch,
> hbase-16604_v1.patch, hbase-16604_v2.patch, hbase-16604_v3.branch-1.patch,
> hbase-16604_v3.patch
>
>
> Debugging an ITBLL failure, where the Verify did not "see" all the data in
> the cluster, I've noticed that if we end up getting a generic IOException
> from the HFileReader level, we may end up missing the rest of the data in the
> region. I was able to manually test this, and this stack trace helps to
> understand what is going on:
> {code}
> 2016-09-09 16:27:15,633 INFO [hconnection-0x71ad3d8a-shared--pool21-t9]
> client.ScannerCallable(376): Open scanner=1 for
> scan={"loadColumnFamiliesOnDemand":null,"startRow":"","stopRow":"","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":2097152,"families":{"testFamily":["testFamily"]},"caching":100,"maxVersions":1,"timeRange":[0,9223372036854775807]}
> on region
> region=testScanThrowsException,,1473463632707.b2adfb618e5d0fe225c1dc40c0eabfee.,
> hostname=hw10676,51833,1473463626529, seqNum=2
> 2016-09-09 16:27:15,634 INFO
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833]
> regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows:
> 100 close_scanner: false next_call_seq: 0 client_handles_partials: true
> client_handles_heartbeats: true renew: false
> 2016-09-09 16:27:15,635 INFO
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833]
> regionserver.RSRpcServices(2510): Rolling back next call seqId
> 2016-09-09 16:27:15,635 INFO
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833]
> regionserver.RSRpcServices(2565): Throwing new
> ServiceExceptionjava.io.IOException: Could not reseek
> StoreFileScanner[HFileScanner for reader
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
> compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0,
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112,
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368,
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25},
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false,
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false,
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put,
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35,
> avgValueLen=3, entries=17576, length=866998,
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
> 2016-09-09 16:27:15,635 DEBUG
> [B.fifo.QRpcServer.handler=5,queue=0,port=51833] ipc.CallRunner(110):
> B.fifo.QRpcServer.handler=5,queue=0,port=51833: callId: 26 service:
> ClientService methodName: Scan size: 26 connection: 192.168.42.75:51903
> java.io.IOException: Could not reseek StoreFileScanner[HFileScanner for
> reader
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
> compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0,
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112,
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368,
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25},
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false,
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false,
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put,
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35,
> avgValueLen=3, entries=17576, length=866998,
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
> at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
> at
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: foo bar baz
> at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
> ... 17 more
> 2016-09-09 16:27:15,653 INFO [hconnection-0x71ad3d8a-shared--pool21-t10]
> client.ScannerCallable(261): Got exception making request scanner_id: 1
> number_of_rows: 100 close_scanner: false next_call_seq: 0
> client_handles_partials: true client_handles_heartbeats: true renew: false to
> region=testScanThrowsException,,1473463632707.b2adfb618e5d0fe225c1dc40c0eabfee.,
> hostname=hw10676,51833,1473463626529, seqNum=2
> java.io.IOException: java.io.IOException: Could not reseek
> StoreFileScanner[HFileScanner for reader
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
> compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0,
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112,
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368,
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25},
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false,
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false,
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put,
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35,
> avgValueLen=3, entries=17576, length=866998,
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
> at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
> at
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: foo bar baz
> at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
> ... 17 more
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:332)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:256)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:1)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:364)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:1)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
> at
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException):
> java.io.IOException: Could not reseek StoreFileScanner[HFileScanner for
> reader
> reader=hdfs://localhost:51795/user/enis/test-data/d6fb1c70-93c1-4099-acb7-5723fc05a737/data/default/testScanThrowsException/b2adfb618e5d0fe225c1dc40c0eabfee/testFamily/5a213cc23b714e5e8e1a140ebbe72f2c,
> compression=none, cacheConf=blockCache=LruBlockCache{blockCount=0,
> currentSize=1567264, freeSize=1525578848, maxSize=1527146112,
> heapSize=1567264, minSize=1450788736, minFactor=0.95, multiSize=725394368,
> multiFactor=0.5, singleSize=362697184, singleFactor=0.25},
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false,
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false,
> prefetchOnOpen=false, firstKey=aaa/testFamily:testFamily/1473463633859/Put,
> lastKey=zzz/testFamily:testFamily/1473463634271/Put, avgKeyLen=35,
> avgValueLen=3, entries=17576, length=866998,
> cur=/testFamily:/OLDEST_TIMESTAMP/Minimum/vlen=0/seqid=0] to key
> /testFamily:testFamily/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
> at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:224)
> at
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:312)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:815)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:803)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:625)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5887)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5674)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2434)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32295)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2127)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$2.run(RpcExecutor.java:108)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: foo bar baz
> at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:207)
> ... 17 more
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1225)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32741)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:214)
> ... 9 more
> 2016-09-09 16:27:15,963 INFO
> [B.priority.fifo.QRpcServer.handler=8,queue=0,port=51824]
> regionserver.RSRpcServices(2196): scan request:region { type: REGION_NAME
> value: "hbase:meta,,1" } scan { start_row:
> "testScanThrowsException,,99999999999999" max_versions: 1 cache_blocks: true
> small: true reversed: true caching: 1 } number_of_rows: 1 close_scanner: true
> client_handles_partials: true client_handles_heartbeats: true
> 2016-09-09 16:27:15,965 INFO
> [B.priority.fifo.QRpcServer.handler=8,queue=0,port=51824]
> regionserver.RSRpcServices(2549): class
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:cells_per_result:
> 4 scanner_id: 15 more_results: false stale: false partial_flag_per_result:
> false more_results_in_region: true heartbeat_message: false
> 2016-09-09 16:27:15,967 INFO
> [B.fifo.QRpcServer.handler=0,queue=0,port=51833]
> regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows:
> 100 close_scanner: false next_call_seq: 0 client_handles_partials: true
> client_handles_heartbeats: true renew: false
> 2016-09-09 16:27:15,967 INFO
> [B.fifo.QRpcServer.handler=0,queue=0,port=51833]
> regionserver.RSRpcServices(2549): class
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:scanner_id:
> 1 more_results: true ttl: 60000 stale: false more_results_in_region: false
> 2016-09-09 16:27:15,968 INFO
> [B.fifo.QRpcServer.handler=1,queue=0,port=51833]
> regionserver.RSRpcServices(2196): scan request:scanner_id: 1 number_of_rows:
> 0 close_scanner: true client_handles_partials: true
> client_handles_heartbeats: true
> 2016-09-09 16:27:15,968 INFO
> [B.fifo.QRpcServer.handler=1,queue=0,port=51833]
> regionserver.RSRpcServices(2549): class
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanResponse:scanner_id:
> 1 more_results: false
> 2016-09-09 16:27:15,971 INFO [main] hbase.HBaseTestingUtility(1111):
> Shutting down minicluster
> {code}
> The ScannerCallable is retryable if it gets a IOException. Before the
> HBASE-13662 fix, the retry was still failing due to
> OutOfOrderScannerException, which then causes the Scanner stack to be reset,
> thus masking the issue. After HBASE-13662, we can retry the same scan, but in
> case of we are doing a {{reseek()}} which fails in middle with an
> IOException, the KVHeaps and the Scanner state on the server side is left
> dirty.
> It seems to be much safer to reset the scanner state in case of IOExceptions.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)