[jira] [Commented] (HBASE-11295) Long running scan produces OutOfOrderScannerNextException

James Estes (JIRA) Tue, 13 Jan 2015 10:15:12 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275634#comment-14275634
 ]


James Estes commented on HBASE-11295:
-------------------------------------

I agree with chunhui. Retrying will likely just wind up resulting in the same 
exception which doesn't say much to the caller about what actually happened, 
nor what they may be able to do to fix it (eg "try increasing the rpc 
timeout"). It seems to me that the underlying issue is a scan rpc times out b/c 
it's doing a lot of filtering, then retries and immediately gets an OOSNE 
(which seems to have reasons here for being the right thing to do), and throws 
that, but without the original timeout exception to give more context to the 
caller that the issue may have started with that timeout exception. 

I have a similar issue, but much more troublesome in that it winds up in an 
endless cycle: HBASE-12266. In that case as well, increasing the rpc timeout 
was the only fix for the issue, but it took quite a bit of digging to figure 
that out. I haven't been able to test the v2 patch there (the timeout change 
was sufficient for me), but I like what it does: it makes the timeout check on 
an DoNotRetryIOException unconditional (ie not just for 
UnknownScannerExceptions). Even in my use case, the exception would have still 
occurred, but it would have given me more meaningful information about what I 
could to. At the very least, having this timeout check could be useful to 
provide more information in the exception, so users have an idea of what to do 
to proceed (without digging through the code :) ). 

> Long running scan produces OutOfOrderScannerNextException
> ---------------------------------------------------------
>
>                 Key: HBASE-11295
>                 URL: https://issues.apache.org/jira/browse/HBASE-11295
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.96.0
>            Reporter: Jeff Cunningham
>            Assignee: Andrew Purtell
>            Priority: Critical
>             Fix For: 1.0.0, 2.0.0, 0.98.10, 1.1.0
>
>         Attachments: OutOfOrderScannerNextException.tar.gz
>
>
> Attached Files:
> HRegionServer.java - instramented from 0.96.1.1-cdh5.0.0
> HBaseLeaseTimeoutIT.java - reproducing JUnit 4 test
> WaitFilter.java - Scan filter (extends FilterBase) that overrides 
> filterRowKey() to sleep during invocation
> SpliceFilter.proto - Protobuf defintiion for WaitFilter.java
> OutOfOrderScann_InstramentedServer.log - instramented server log
> Steps.txt - this note
> Set up:
> In HBaseLeaseTimeoutIT, create a scan, set the given filter (which sleeps in 
> overridden filterRowKey() method) and set it on the scan, and scan the table.
> This is done in test client_0x0_server_150000x10().
> Here's what I'm seeing (see also attached log):
> A new request comes into server (ID 1940798815214593802 - 
> RpcServer.handler=96) and a RegionScanner is created for it, cached by ID, 
> immediately looked up again and cached RegionScannerHolder's nextCallSeq 
> incremeted (now at 1).
> The RegionScan thread goes to sleep in WaitFilter#filterRowKey().
> A short (variable) period later, another request comes into the server (ID 
> 8946109289649235722 - RpcServer.handler=98) and the same series of events 
> happen to this request.
> At this point both RegionScanner threads are sleeping in 
> WaitFilter.filterRowKey(). After another period, the client retries another 
> scan request which thinks its next_call_seq is 0.  However, HRegionServer's 
> cached RegionScannerHolder thinks the matching RegionScanner's nextCallSeq 
> should be 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-11295) Long running scan produces OutOfOrderScannerNextException

Reply via email to