[
https://issues.apache.org/jira/browse/HBASE-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275634#comment-14275634
]
James Estes commented on HBASE-11295:
-------------------------------------
I agree with chunhui. Retrying will likely just wind up resulting in the same
exception which doesn't say much to the caller about what actually happened,
nor what they may be able to do to fix it (eg "try increasing the rpc
timeout"). It seems to me that the underlying issue is a scan rpc times out b/c
it's doing a lot of filtering, then retries and immediately gets an OOSNE
(which seems to have reasons here for being the right thing to do), and throws
that, but without the original timeout exception to give more context to the
caller that the issue may have started with that timeout exception.
I have a similar issue, but much more troublesome in that it winds up in an
endless cycle: HBASE-12266. In that case as well, increasing the rpc timeout
was the only fix for the issue, but it took quite a bit of digging to figure
that out. I haven't been able to test the v2 patch there (the timeout change
was sufficient for me), but I like what it does: it makes the timeout check on
an DoNotRetryIOException unconditional (ie not just for
UnknownScannerExceptions). Even in my use case, the exception would have still
occurred, but it would have given me more meaningful information about what I
could to. At the very least, having this timeout check could be useful to
provide more information in the exception, so users have an idea of what to do
to proceed (without digging through the code :) ).
> Long running scan produces OutOfOrderScannerNextException
> ---------------------------------------------------------
>
> Key: HBASE-11295
> URL: https://issues.apache.org/jira/browse/HBASE-11295
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 0.96.0
> Reporter: Jeff Cunningham
> Assignee: Andrew Purtell
> Priority: Critical
> Fix For: 1.0.0, 2.0.0, 0.98.10, 1.1.0
>
> Attachments: OutOfOrderScannerNextException.tar.gz
>
>
> Attached Files:
> HRegionServer.java - instramented from 0.96.1.1-cdh5.0.0
> HBaseLeaseTimeoutIT.java - reproducing JUnit 4 test
> WaitFilter.java - Scan filter (extends FilterBase) that overrides
> filterRowKey() to sleep during invocation
> SpliceFilter.proto - Protobuf defintiion for WaitFilter.java
> OutOfOrderScann_InstramentedServer.log - instramented server log
> Steps.txt - this note
> Set up:
> In HBaseLeaseTimeoutIT, create a scan, set the given filter (which sleeps in
> overridden filterRowKey() method) and set it on the scan, and scan the table.
> This is done in test client_0x0_server_150000x10().
> Here's what I'm seeing (see also attached log):
> A new request comes into server (ID 1940798815214593802 -
> RpcServer.handler=96) and a RegionScanner is created for it, cached by ID,
> immediately looked up again and cached RegionScannerHolder's nextCallSeq
> incremeted (now at 1).
> The RegionScan thread goes to sleep in WaitFilter#filterRowKey().
> A short (variable) period later, another request comes into the server (ID
> 8946109289649235722 - RpcServer.handler=98) and the same series of events
> happen to this request.
> At this point both RegionScanner threads are sleeping in
> WaitFilter.filterRowKey(). After another period, the client retries another
> scan request which thinks its next_call_seq is 0. However, HRegionServer's
> cached RegionScannerHolder thinks the matching RegionScanner's nextCallSeq
> should be 1.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)