[
https://issues.apache.org/jira/browse/HBASE-15001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064551#comment-15064551
]
Ashu Pachauri commented on HBASE-15001:
---------------------------------------
The point 2 of the description does not reproduce because it is masked by an
ArithmeticException (Divide by zero) here in
HBaseInterClusterReplicationEndpoint:
{code}
entryLists.get(Math.abs(Bytes.hashCode(e.getKey().getEncodedRegionName())%n)).add(e);
{code}
So, I will change this in description. However, I would still like to make a
point that we should keep the following verification check in patch:
{code}
if (numReplicated != entries.size()) {
+ // Something went wrong here and we don't know what, let's just fail
and retry.
+ LOG.warn("The number of edits replicated is different from the
number received,"
+ + " failing for now.");
+ return false;
+ }
{code}
This allows for more robustness in replication loop and makes it easy to catch
errors during replication, especially in a multithreaded world.
> Thread Safety issues in ReplicationSinkManager and
> HBaseInterClusterReplicationEndpoint
> ---------------------------------------------------------------------------------------
>
> Key: HBASE-15001
> URL: https://issues.apache.org/jira/browse/HBASE-15001
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.2.1
> Reporter: Ashu Pachauri
> Assignee: Ashu Pachauri
> Priority: Blocker
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-15001-V0.patch, Test.java,
> repro_stuck_replication.diff
>
>
> ReplicationSinkManager is not thread-safe. This can cause problems in
> HBaseInterClusterReplicationEndpoint, when the walprovider is multiwal.
> For example:
> 1. When multiple threads report bad sinks, the sink list can be non-empty but
> report a negative size because the ArrayList itself is not thread-safe.
> 2. HBaseInterClusterReplicationEndpoint depends on the number of sinks to
> batch edits for shipping. However, it's quite possible that the following
> code makes it assume that there are no batches to process (sink size is
> non-zero, but by the time we reach the "batching" part, sink size becomes
> zero.)
> {code}
> if (replicationSinkMgr.getSinks().size() == 0) {
> return false;
> }
> ...
> int n = Math.min(Math.min(this.maxThreads, entries.size()/100+1),
> replicationSinkMgr.getSinks().size());
> {code}
> This is very dangerous, because, (incorrectly) assuming no batches to process
> based on value of n, we can safely report that we replicated successfully,
> while we actually did not replicate anything.
> The idea is to make all operations in ReplicationSinkManager thread-safe and
> do a verification on the size of replicated edits before we report success.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)