Hi! In the current Reference Guide the section "75. Timeline-consistent High Available Reads" is flagged as "maybe broken. Use it with caution". I'm not familiar with the original reason it was flagged but I have spent a few weeks working on this and after a few small fixes it looked stable enough. I think we should remove this warning for new 2.2+ releases. Below are some details about the fixes and the testing I did.
Fixes: - HBASE-23589 FlushDescriptor contains non-matching family/output combinations <https://issues.apache.org/jira/browse/HBASE-23589> - HBASE-23601 OutputSink.WriterThread exception gets stuck and repeated indefinitely <https://issues.apache.org/jira/browse/HBASE-23601> Testing: After the fixes I run IntegrationTestRegionReplicaReplication for testing on a 4 machine cluster (3 RS, 30GB heap/RS). I used the default test parameters, only increased read_delay_ms to 60000. The longest uninterrupted run I tried was 8 hours and I encountered no issues. Even adding in the chaos monkeys (slowDeterministic) hasn't revealed any new correctness issues with the feature. Next steps: - Further testing. I realize IntegrationTestRegionReplicaReplication provides a very uniform, unrealistic load, using different data could be interesting. If someone would find the time to run a few tests or propose some scenarios I would be grateful. - I was thinking of providing a cleaner flush logic on replication side, but my proposal might have too much overhead and the current logic while having issues works after the previous fixes. The proposal can be found in HBASE-23591, any feedback would be welcomed. Thoughts?
