[ 
https://issues.apache.org/jira/browse/HBASE-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reopened HBASE-11295:
------------------------------------

We can get an OutOfOrderScannerNextException if the server thinks it has 
processed a scanner ‘next’ call but the client does not, and retries that 
‘next’ RPC, which happens to fail again even though technically it's using a 
new (relocated) scanner.

When the client gets a OutOfOrderScannerNextException, the ClientScanner will 
retry - once. We use the boolean control variable 
{{retryAfterOutOfOrderException}}, set to 'true' initially, then set to 'false' 
when looping back to relocate and retry. 

A comment in ScannerCallable#next says: "_If at the server side fetching of 
next batch of data was over, there will be mismatch in the nextCallSeq number. 
Server will throw OutOfOrderScannerNextException and then client will reopen 
the scanner with start row as the last successfully retrieved row._” This is 
what happens. We set ‘callable' to null before looping back around, so 
nextScanner() will create a new ScannerCallable. The new ScannerCallable does 
not have an initialized ‘scannerId’ so it builds a scan open request and sends 
it to the server. On the server side, this creates a new RegionScanner with a 
new identifier. This is like starting the scan over, except the start row has 
been updated to the last position of the previous so from the application 
perspective the result stream is seamless. Both the new RegionScanner and the 
ScannerCallable on the client restart with nextCallSeq values of 0. 

Now with the new scanner we run into bad luck. With the new scanner on this 
"retry" this request times out like the first one, again with the server 
thinking the client should have advanced. However inside the ClientScanner 
state the value of retryAfterOutOfOrderException is ‘false', so this time we 
let out the OutOfOrderScannerNextException exception to bubble up to the 
application, "expecting nextCallSeq 1, got 0"

I could be missing something. If not, this doesn’t seem quite right. We are 
using a new scanner after relocation, like we do for NSREs, that just happens 
to fail the same way as the last one due, perhaps due to socket timeout sending 
the response under similar prevailing conditions. Why have the special case 
handling controlled by retryAfterOutOfOrderException? We retry NSREs up to a 
configured threshold, then give up. Use the same threshold for 
OutOfOrderScannerNextExceptions?

Thoughts?

> Long running scan produces OutOfOrderScannerNextException
> ---------------------------------------------------------
>
>                 Key: HBASE-11295
>                 URL: https://issues.apache.org/jira/browse/HBASE-11295
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.96.0
>            Reporter: Jeff Cunningham
>         Attachments: OutOfOrderScannerNextException.tar.gz
>
>
> Attached Files:
> HRegionServer.java - instramented from 0.96.1.1-cdh5.0.0
> HBaseLeaseTimeoutIT.java - reproducing JUnit 4 test
> WaitFilter.java - Scan filter (extends FilterBase) that overrides 
> filterRowKey() to sleep during invocation
> SpliceFilter.proto - Protobuf defintiion for WaitFilter.java
> OutOfOrderScann_InstramentedServer.log - instramented server log
> Steps.txt - this note
> Set up:
> In HBaseLeaseTimeoutIT, create a scan, set the given filter (which sleeps in 
> overridden filterRowKey() method) and set it on the scan, and scan the table.
> This is done in test client_0x0_server_150000x10().
> Here's what I'm seeing (see also attached log):
> A new request comes into server (ID 1940798815214593802 - 
> RpcServer.handler=96) and a RegionScanner is created for it, cached by ID, 
> immediately looked up again and cached RegionScannerHolder's nextCallSeq 
> incremeted (now at 1).
> The RegionScan thread goes to sleep in WaitFilter#filterRowKey().
> A short (variable) period later, another request comes into the server (ID 
> 8946109289649235722 - RpcServer.handler=98) and the same series of events 
> happen to this request.
> At this point both RegionScanner threads are sleeping in 
> WaitFilter.filterRowKey(). After another period, the client retries another 
> scan request which thinks its next_call_seq is 0.  However, HRegionServer's 
> cached RegionScannerHolder thinks the matching RegionScanner's nextCallSeq 
> should be 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to