[ https://issues.apache.org/jira/browse/HBASE-11288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173529#comment-17173529 ]
Duo Zhang commented on HBASE-11288: ----------------------------------- Thank you guys, great progress here. >From my experience, we met the same situation when testing TRSP with ITBLL. >That is, no data loss, but it is really hard to finish a whole ITBLL run. It >will always hit rpc timeout when putting or verifying. I think this is easy to understand. We do not change the WAL implementation, and do not change the actually WAL splitting either(we just change how we schedule WAL splitting), so it is not likely to introduce data loss. It is not easy to introduce data lost even for a double assign, it will just introduce FileNotFoundException when the stale region compacts HFiles and then fail the requests(it can be fixed by a RS restarting). What we change here is the critical procedures for dealing with region assign, and also master startup. When there are corner cases we do not handle correctly, the result is a cluster hang. Sometimes the cluster will hang for a very long time and then recover by itself, and usually you will not sit there to see the ITBLL right? So when you come back, you will just see a healthy cluster and a rpc timeout in the ITBLL log. And what you can see is that a region can not online for a very long time but finally it is online, so you will think this is just because the cluster is overloaded. But actually, if you run ITBLL enough times, you will finally get a chance to see the cluster in hanging state, and then find out the root cause. I still remember that we had done a lot of works to deal with the corner cases but still could not fully solve the problem until I noticed that we were on the wrong direction and introduced HBASE-22074. It is really not easy to fix all the corner cases. Thanks. > Splittable Meta > --------------- > > Key: HBASE-11288 > URL: https://issues.apache.org/jira/browse/HBASE-11288 > Project: HBase > Issue Type: Umbrella > Components: meta > Reporter: Francis Christopher Liu > Assignee: Francis Christopher Liu > Priority: Major > Attachments: jstack20200807_bad_rpc_priority.txt, root_priority.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005)