Properly treating SocketTimeoutException
----------------------------------------

                 Key: HBASE-4462
                 URL: https://issues.apache.org/jira/browse/HBASE-4462
             Project: HBase
          Issue Type: Improvement
    Affects Versions: 0.90.4
            Reporter: Jean-Daniel Cryans
             Fix For: 0.92.0


SocketTimeoutException is currently treated like any IOE inside of 
HCM.getRegionServerWithRetries and I think this is a problem. This method 
should only do retries in cases where we are pretty sure the operation will 
complete, but with STE we already waited for (by default) 60 seconds and 
nothing happened.

I found this while debugging Douglas Campbell's problem on the mailing list 
where it seemed like he was using the same scanner from multiple threads, but 
actually it was just the same client doing retries while the first run didn't 
even finish yet (that's another problem). You could see the first scanner, then 
up to two other handlers waiting for it to finish in order to run (because of 
the synchronization on RegionScanner).

So what should we do? We could treat STE as a DoNotRetryException and let the 
client deal with it, or we could retry only once.

There's also the option of having a different behavior for get/put/icv/scan, 
the issue with operations that modify a cell is that you don't know if the 
operation completed or not (same when a RS dies hard after completing let's say 
a Put but just before returning to the client).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to