[
https://issues.apache.org/jira/browse/HBASE-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271863#comment-13271863
]
Todd Lipcon commented on HBASE-5973:
------------------------------------
Attached patch implements the suggested idea, and hooks it up for
scanner.next().
I spent 2.5 hours trying to write a test case for it, but we have so many
layers of byzantine caching going on above the IPC sockets that I couldn't
figure out how to make a client IPC connection actually hard-disconnect. So I
tested it from the shell. here's the manual test plan:
1) create a table with 100 or so rows
2) issue following from shell:
{code}
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.95-SNAPSHOT, r5c65cc4a19fbc00876a365b10e98142238dc9a97, Wed May 9
13:06:25 PDT 2012
hbase(main):001:0> import org.apache.hadoop.hbase.filter.TestFilter
=> Java::OrgApacheHadoopHbaseFilter::TestFilter
hbase(main):002:0> scan 't1', { FILTER => TestFilter::SlowScanFilter.new(),
CACHE => 50 }
ROW COLUMN+CELL
{code}
(shell will hang here)
On the server side, you should see:
{code}
12/05/09 15:03:29 INFO filter.TestFilter: Handler thread Thread[IPC Server
handler 0 on 58364,5,main] sleeping in filter...
12/05/09 15:03:30 INFO filter.TestFilter: Handler thread Thread[IPC Server
handler 0 on 58364,5,main] sleeping in filter...
12/05/09 15:03:31 INFO filter.TestFilter: Handler thread Thread[IPC Server
handler 0 on 58364,5,main] sleeping in filter...
12/05/09 15:03:32 INFO filter.TestFilter: Handler thread Thread[IPC Server
handler 0 on 58364,5,main] sleeping in filter...
{code}
Now ^C the shell. You should see on the server:
{code}
12/05/09 15:03:33 ERROR regionserver.RegionServer:
org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call
scan(null, scannerId: 4581116627867187291
numberOfRows: 50
closeScanner: false
), rpc version=1, client version=1, methodsFingerPrint=-944626147 from
127.0.0.1:55648 after 5009 ms, since caller disconnected
at
org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:417)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3433)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3391)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3415)
at
org.apache.hadoop.hbase.regionserver.RegionServer.scan(RegionServer.java:828)
at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:358)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1387)
12/05/09 15:03:33 WARN ipc.HBaseServer: IPC Server Responder, call scan(null,
scannerId: 4581116627867187291
numberOfRows: 50
closeScanner: false
), rpc version=1, client version=1, methodsFingerPrint=-944626147 from
127.0.0.1:55648: output error
12/05/09 15:03:33 WARN ipc.HBaseServer: IPC Server handler 0 on 58364 caught a
ClosedChannelException, this means that the server was processing a request but
the client went away. The error message was: null
{code}
We could probably improve the messaging slightly, but this is at least an
improvement in that the thread doesn't continue to get hung up indefinitely.
> Add ability for potentially long-running IPC calls to abort if client
> disconnects
> ---------------------------------------------------------------------------------
>
> Key: HBASE-5973
> URL: https://issues.apache.org/jira/browse/HBASE-5973
> Project: HBase
> Issue Type: Improvement
> Components: ipc
> Affects Versions: 0.90.7, 0.92.1, 0.94.0, 0.96.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Attachments: hbase-5973.txt
>
>
> We recently had a cluster issue where a user was submitting scanners with a
> very restrictive filter, and then calling next() with a high scanner caching
> value. The clients would generally time out the next() call and disconnect,
> but the IPC kept running looking to fill the requested number of rows. Since
> this was in the context of MR, the tasks making the calls would retry, and
> the retries wuld be more likely to time out due to contention with the
> previous still-running scanner next() call. Eventually, the system spiraled
> out of control.
> We should add a hook to the IPC system so that RPC calls can check if the
> client has already disconnected. In such a case, the next() call could abort
> processing, given any further work is wasted. I imagine coprocessor
> endpoints, etc, could make good use of this as well.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira