[
https://issues.apache.org/jira/browse/HBASE-15001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066039#comment-15066039
]
Sean Busbey commented on HBASE-15001:
-------------------------------------
generally, yes. we should be getting a buildbot run in to ensure there aren't
any surprises. it looks like this issue never changed from "Open" status, so
the build bot never knew to look. mistakes happen, we can clean most things up
if something is broken.
[~tedyu] would you prefer any clean up happen in this ticket (please specify
what) or are your fine with a follow on?
> Thread Safety issues in ReplicationSinkManager and
> HBaseInterClusterReplicationEndpoint
> ---------------------------------------------------------------------------------------
>
> Key: HBASE-15001
> URL: https://issues.apache.org/jira/browse/HBASE-15001
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.2.1
> Reporter: Ashu Pachauri
> Assignee: Ashu Pachauri
> Priority: Blocker
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-15001-V0.patch, Test.java,
> repro_stuck_replication.diff
>
>
> ReplicationSinkManager is not thread-safe. This can cause problems in
> HBaseInterClusterReplicationEndpoint, when the walprovider is multiwal.
> For example:
> 1. When multiple threads report bad sinks, the sink list can be non-empty but
> report a negative size because the ArrayList itself is not thread-safe.
> 2. HBaseInterClusterReplicationEndpoint depends on the number of sinks to
> batch edits for shipping. However, it's quite possible that the following
> code makes it assume that there are no batches to process (sink size is
> non-zero, but by the time we reach the "batching" part, sink size becomes
> zero.)
> {code}
> if (replicationSinkMgr.getSinks().size() == 0) {
> return false;
> }
> ...
> int n = Math.min(Math.min(this.maxThreads, entries.size()/100+1),
> replicationSinkMgr.getSinks().size());
> {code}
> [Update] This leads to ArithmeticException: division by zero at:
> {code}
> entryLists.get(Math.abs(Bytes.hashCode(e.getKey().getEncodedRegionName())%n)).add(e);
> {code}
> which is benign and will just lead to retries by the ReplicationSource.
> The idea is to make all operations in ReplicationSinkManager thread-safe and
> do a verification on the size of replicated edits before we report success.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)