[ 
https://issues.apache.org/jira/browse/HBASE-11288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173529#comment-17173529
 ] 

Duo Zhang commented on HBASE-11288:
-----------------------------------

Thank you guys, great progress here.

>From my experience, we met the same situation when testing TRSP with ITBLL. 
>That is, no data loss, but it is really hard to finish a whole ITBLL run. It 
>will always hit rpc timeout when putting or verifying.

I think this is easy to understand. We do not change the WAL implementation, 
and do not change the actually WAL splitting either(we just change how we 
schedule WAL splitting), so it is not likely to introduce data loss. It is not 
easy to introduce data lost even for a double assign, it will just introduce 
FileNotFoundException when the stale region compacts HFiles and then fail the 
requests(it can be fixed by a RS restarting).

What we change here is the critical procedures for dealing with region assign, 
and also master startup. When there are corner cases we do not handle 
correctly, the result is a cluster hang. Sometimes the cluster will hang for a 
very long time and then recover by itself, and usually you will not sit there 
to see the ITBLL right? So when you come back, you will just see a healthy 
cluster and a rpc timeout in the ITBLL log. And what you can see is that a 
region can not online for a very long time but finally it is online, so you 
will think this is just because the cluster is overloaded. But actually, if you 
run ITBLL enough times, you will finally get a chance to see the cluster in 
hanging state, and then find out the root cause.

I still remember that we had done a lot of works to deal with the corner cases 
but still could not fully solve the problem until I noticed that we were on the 
wrong direction and introduced HBASE-22074. It is really not easy to fix all 
the corner cases.

Thanks.

> Splittable Meta
> ---------------
>
>                 Key: HBASE-11288
>                 URL: https://issues.apache.org/jira/browse/HBASE-11288
>             Project: HBase
>          Issue Type: Umbrella
>          Components: meta
>            Reporter: Francis Christopher Liu
>            Assignee: Francis Christopher Liu
>            Priority: Major
>         Attachments: jstack20200807_bad_rpc_priority.txt, root_priority.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to