Hi, Replication works good when run in short span. But its performance for a long running setup seems to degrade at the slave cluster side. To an extant, it made it unresponsive in one of our testing environment. As per jstack on one node, all its priority handlers were blocked in the replicateLogEntries method, which is blocked as the cluster is in bad shape (2/4 nodes died; root is unassigned; and the node which had it previously became un-responsive; and the only other remaining node doesn't have any priority handler left to take care of the root region assignment). The memory footprint of the app also increases (based on `top`; unfortunately, no gc logs at the moment).
The replicateLogEntries is a high QOS method; ReplicationSink's overall behavior is to act as a native hbase client and replicate the mutations in its cluster. This may take some time, in case region is splitting, possible gc pause, etc at the target region servers. It enters in the retrying loop, and this blocks the priority handler serving that method. Meanwhile, other master cluster region servers are also shipping edits (to this, or other regionservers). This makes the situation more worse. I wonder whether others have seen this before. Please share. There is some scope of improvements at Sink side: a) ReplicationSink#replicateLogEntries: Make it a normal operation (no high QOS annotation), and ReplicationSink periodically checks whether the client is still connected or not. In case its not, just throws an exception and bail out. The client will do a resend of the shipment anyway. This frees up the handlers from blocking, and cluster's normal operation will not be impeded. b) Have a threadpool in ReplicationSink and process per table request in parallel. Should help in case of multi table replication. c) Freeing the memory consumed by the shipped array, as soon as the mutation list is populated. Currently, if the call to multi is blocked (by any reason), the regionserver enters in the retrying logic... and since entries of WALEdits array is copied as Put/Delete objects, it can be freed. Looking forward for some more/better suggestions, and make replication more stable. Thanks, Himanshu
