[ 
https://issues.apache.org/jira/browse/HBASE-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271863#comment-13271863
 ] 

Todd Lipcon commented on HBASE-5973:
------------------------------------

Attached patch implements the suggested idea, and hooks it up for 
scanner.next().

I spent 2.5 hours trying to write a test case for it, but we have so many 
layers of byzantine caching going on above the IPC sockets that I couldn't 
figure out how to make a client IPC connection actually hard-disconnect. So I 
tested it from the shell. here's the manual test plan:

1) create a table with 100 or so rows
2) issue following from shell:

{code}
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.95-SNAPSHOT, r5c65cc4a19fbc00876a365b10e98142238dc9a97, Wed May  9 
13:06:25 PDT 2012

hbase(main):001:0> import org.apache.hadoop.hbase.filter.TestFilter
=> Java::OrgApacheHadoopHbaseFilter::TestFilter
hbase(main):002:0> scan 't1', { FILTER => TestFilter::SlowScanFilter.new(), 
CACHE => 50 }
ROW                                            COLUMN+CELL                      
                                                                                
                       
{code}
(shell will hang here)

On the server side, you should see:
{code}

12/05/09 15:03:29 INFO filter.TestFilter: Handler thread Thread[IPC Server 
handler 0 on 58364,5,main] sleeping in filter...
12/05/09 15:03:30 INFO filter.TestFilter: Handler thread Thread[IPC Server 
handler 0 on 58364,5,main] sleeping in filter...
12/05/09 15:03:31 INFO filter.TestFilter: Handler thread Thread[IPC Server 
handler 0 on 58364,5,main] sleeping in filter...
12/05/09 15:03:32 INFO filter.TestFilter: Handler thread Thread[IPC Server 
handler 0 on 58364,5,main] sleeping in filter...
{code}

Now ^C the shell. You should see on the server:

{code}
12/05/09 15:03:33 ERROR regionserver.RegionServer: 
org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call 
scan(null, scannerId: 4581116627867187291
numberOfRows: 50
closeScanner: false
), rpc version=1, client version=1, methodsFingerPrint=-944626147 from 
127.0.0.1:55648 after 5009 ms, since caller disconnected
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:417)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3433)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3391)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3415)
        at 
org.apache.hadoop.hbase.regionserver.RegionServer.scan(RegionServer.java:828)
        at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:358)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1387)
12/05/09 15:03:33 WARN ipc.HBaseServer: IPC Server Responder, call scan(null, 
scannerId: 4581116627867187291
numberOfRows: 50
closeScanner: false
), rpc version=1, client version=1, methodsFingerPrint=-944626147 from 
127.0.0.1:55648: output error
12/05/09 15:03:33 WARN ipc.HBaseServer: IPC Server handler 0 on 58364 caught a 
ClosedChannelException, this means that the server was processing a request but 
the client went away. The error message was: null
{code}

We could probably improve the messaging slightly, but this is at least an 
improvement in that the thread doesn't continue to get hung up indefinitely.
                
> Add ability for potentially long-running IPC calls to abort if client 
> disconnects
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-5973
>                 URL: https://issues.apache.org/jira/browse/HBASE-5973
>             Project: HBase
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.90.7, 0.92.1, 0.94.0, 0.96.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hbase-5973.txt
>
>
> We recently had a cluster issue where a user was submitting scanners with a 
> very restrictive filter, and then calling next() with a high scanner caching 
> value. The clients would generally time out the next() call and disconnect, 
> but the IPC kept running looking to fill the requested number of rows. Since 
> this was in the context of MR, the tasks making the calls would retry, and 
> the retries wuld be more likely to time out due to contention with the 
> previous still-running scanner next() call. Eventually, the system spiraled 
> out of control.
> We should add a hook to the IPC system so that RPC calls can check if the 
> client has already disconnected. In such a case, the next() call could abort 
> processing, given any further work is wasted. I imagine coprocessor 
> endpoints, etc, could make good use of this as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to