[
https://issues.apache.org/jira/browse/HBASE-4462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113749#comment-13113749
]
Douglas Campbell commented on HBASE-4462:
-----------------------------------------
+1 on no retry for socket timeout. retrying after already timing out opens the
possibility of timing out again which could double the delay with no reward.
Another thought is to allow retries but ensure older scan iterators are
aborted, try to notify client threads holding them to abort, and only allow the
newest scan iterator to stay alive.
Without knowing the code, this may be difficult but it get's what you want i.e.
only one iterator over a single object, a configurable number of retries, and
region server is not locked up with different threads over a single scanner
object.
> Properly treating SocketTimeoutException
> ----------------------------------------
>
> Key: HBASE-4462
> URL: https://issues.apache.org/jira/browse/HBASE-4462
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 0.90.4
> Reporter: Jean-Daniel Cryans
> Fix For: 0.92.0
>
>
> SocketTimeoutException is currently treated like any IOE inside of
> HCM.getRegionServerWithRetries and I think this is a problem. This method
> should only do retries in cases where we are pretty sure the operation will
> complete, but with STE we already waited for (by default) 60 seconds and
> nothing happened.
> I found this while debugging Douglas Campbell's problem on the mailing list
> where it seemed like he was using the same scanner from multiple threads, but
> actually it was just the same client doing retries while the first run didn't
> even finish yet (that's another problem). You could see the first scanner,
> then up to two other handlers waiting for it to finish in order to run
> (because of the synchronization on RegionScanner).
> So what should we do? We could treat STE as a DoNotRetryException and let the
> client deal with it, or we could retry only once.
> There's also the option of having a different behavior for get/put/icv/scan,
> the issue with operations that modify a cell is that you don't know if the
> operation completed or not (same when a RS dies hard after completing let's
> say a Put but just before returning to the client).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira