[
https://issues.apache.org/jira/browse/HBASE-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans updated HBASE-10249:
---------------------------------------
Attachment: HBASE-10249-0.94-v0.patch
Two things I've noticed that I'm fixing in the attached patch for 0.94:
- The multi path doesn't check if the znode that we're moving is ours, so we
end up deleting our own queue (!!!).
- Looking at the link for the latest failure, we do check that in the non-multi
path but when we do it it takes a few hundreds of milliseconds. It seems that
they all end up counting towards the 10 seconds limit that we have in order to
clear all the queues. I moved the checking of the path before the sleeping in
NodeFailoverWorker.run so that we don't waste time on ourselves.
Regardless, this code is racy:
{noformat}
int numberOfOldSource = 1; // default wait once
while (numberOfOldSource > 0) {
Thread.sleep(SLEEP_TIME);
numberOfOldSource = manager.getOldSources().size();
}
{noformat}
We basically say "let's wait 10 seconds and see if we can transfer _all_ the
queues during that time". If some queues are still being transferred, and the
others we did transfer are already done, they won't count as an oldSource, and
so we can miss them. The most extreme case is moving 1 queue with enough znodes
that it takes more than 10 seconds to move (I've seen that), in which case the
sync tool will stop even though there might be many more queues to transfer.
> Intermittent TestReplicationSyncUpTool failure
> ----------------------------------------------
>
> Key: HBASE-10249
> URL: https://issues.apache.org/jira/browse/HBASE-10249
> Project: HBase
> Issue Type: Bug
> Reporter: Lars Hofhansl
> Assignee: Demai Ni
> Fix For: 0.98.0, 0.96.2, 0.99.0, 0.94.17
>
> Attachments: HBASE-10249-0.94-v0.patch, HBASE-10249-trunk-v0.patch
>
>
> New issue to keep track of this.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)