[ https://issues.apache.org/jira/browse/OMID-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650343#comment-16650343 ]
James Taylor commented on OMID-117: ----------------------------------- {quote}About the retries, what the worst thing that can happen with this? how bad is it to have it like this? {quote} Check out the comment I added to RegionConnectionFactory: {code:java} // This setting controls how many retries occur on the region server if an // IOException occurs while trying to access the commit table. Because a // handler thread will be in use while these retries occur and the client // will be blocked waiting, it must not tie up the call for longer than // the client RPC timeout. Otherwise, the client will initiate retries on it's // end, tying up yet another handler thread. It's best if the retries can be // zero, as in that case the handler is released and the retries occur on the // client side. In testing, we've seen NoServerForRegionException occur which // is a DoNotRetryIOException which are not retried on the client. It's not // clear if this is a real issue or a test-only issue. private static final int DEFAULT_COMMIT_TABLE_ACCESS_ON_READ_RETRIES_NUMBER = 11; private static final int DEFAULT_COMMIT_TABLE_ACCESS_ON_READ_RETRY_PAUSE = 100; {code} As it is with this patch, if retries are necessary to reach the RS hosting the commit table, they will occur from the RS handling the scan for 48 seconds. During this time, the handler thread will be tied up (i.e. it won't be able to be used by any other HBase client). If this occurs for all the handler threads on the RS, then all incoming requests would be queued. For example, non transactional queries would potentially not be processed during this time. If the retries (and pauses) occur on the client side, then non transactional work loads wouldn't be impacted. Ideally, we'd have a test that reproduces this NoServerForRegionException and see if any changes are needed to handle this situation. You might be able to repro this by manually splitting the commit table and then performing a read against a transactional table. It also may just occur the very first time the commit table is attempted to be reached from a RS after the commit table is created. > Ensure timeouts are configured low for RPCs to commit table > ----------------------------------------------------------- > > Key: OMID-117 > URL: https://issues.apache.org/jira/browse/OMID-117 > Project: Apache Omid > Issue Type: Bug > Reporter: James Taylor > Priority: Major > Attachments: OMID-117.patch, OMID-117_hbase2.patch, > OMID-117_v2.patch, OMID-117_v3.patch, OMID-117_v4.patch, OMID-117_v5.patch, > OMID-117_v6.patch, OMID-117_v7.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)