[jira] [Updated] (HBASE-15001) Thread Safety issues in ReplicationSinkManager and HBaseInterClusterReplicationEndpoint

Ashu Pachauri (JIRA) Fri, 18 Dec 2015 11:19:06 -0800

     [ 
https://issues.apache.org/jira/browse/HBASE-15001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ashu Pachauri updated HBASE-15001:
----------------------------------
    Attachment: Test.java
                repro_stuck_replication.diff

Repro_stuck_replication:  This was a randomized version of the bug repro that I 
performed. I have no idea how to deterministically reproduce the bug. I let it 
run on a loop for a few hours last night and I got this in the logs this 
morning (stuck replication on one of the node, it's stuck because once this 
happens, the sink list is never refreshed) :
{code}
2015-12-18 05:08:15,923 WARN  
[main-EventThread.replicationSource,testInterClusterReplication.replicationSource.ashu-mbp.dhcp.thefacebook.com%2C53383%2C1450465675871.regiongroup-0,testInterClusterReplication]
 regionserver.ReplicationSource$ReplicationSourceWorkerThread(1020): 
org.apache.hadoop.hbase.replication.TestReplicationEndpoint$InterClusterReplicationEndpointForTest
 threw unknown exception:java.lang.IllegalArgumentException: Illegal Capacity: 
-1
        at java.util.ArrayList.<init>(ArrayList.java:156)
        at 
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:196)
        at 
org.apache.hadoop.hbase.replication.TestReplicationEndpoint$InterClusterReplicationEndpointForTest.replicate(TestReplicationEndpoint.java:330)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.shipEdits(ReplicationSource.java:983)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:653)
{code}

Anyways, there is no point needed to be made for using synchronized operations 
on an unsafe container. But, just to be sure I performed a multithreaded write 
test on an ArrayList (Attached Test.java) that it can report negative size. 
Here is the output after a few minutes of run:
{code}
List not empty, it's size is:  -1
List not empty, it's size is:  -1
{code}

> Thread Safety issues in ReplicationSinkManager and 
> HBaseInterClusterReplicationEndpoint
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-15001
>                 URL: https://issues.apache.org/jira/browse/HBASE-15001
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.2.1
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>            Priority: Blocker
>             Fix For: 2.0.0, 1.2.0, 1.3.0
>
>         Attachments: HBASE-15001-V0.patch, Test.java, 
> repro_stuck_replication.diff
>
>
> ReplicationSinkManager is not thread-safe. This can cause problems in 
> HBaseInterClusterReplicationEndpoint,  when the walprovider is multiwal. 
> For example: 
> 1. When multiple threads report bad sinks, the sink list can be non-empty but 
> report a negative size because the ArrayList itself is not thread-safe. 
> 2. HBaseInterClusterReplicationEndpoint depends on the number of sinks to 
> batch edits for shipping. However, it's quite possible that the following 
> code makes it assume that there are no batches to process (sink size is 
> non-zero, but by the time we reach the "batching" part, sink size becomes 
> zero.)
> {code}
> if (replicationSinkMgr.getSinks().size() == 0) {
>     return false;
> }
> ...
> int n = Math.min(Math.min(this.maxThreads, entries.size()/100+1),
>                replicationSinkMgr.getSinks().size());
> {code}
> This is very dangerous, because, (incorrectly) assuming no batches to process 
> based on value of n, we can safely report that we replicated successfully, 
> while we actually did not replicate anything. 
> The idea is to make all operations in ReplicationSinkManager thread-safe and 
> do a verification on the size of replicated edits before we report success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15001) Thread Safety issues in ReplicationSinkManager and HBaseInterClusterReplicationEndpoint

Reply via email to