[
https://issues.apache.org/jira/browse/HBASE-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ted Yu updated HBASE-4060:
--------------------------
Description:
>From Eran Kutner:
My concern is that the region allocation process seems to rely too much on
timing considerations and doesn't seem to take enough measures to guarantee
conflicts do not occur. I understand that in a distributed environment, when
you don't get a timely response from a remote machine you can't know for
sure if it did or did not receive the request, however there are things that
can be done to mitigate this and reduce the conflict time significantly. For
example, when I run dbck it knows that some regions are multiply assigned,
the master could do the same and try to resolve the conflict. Another
approach would be to handle late responses, even if the response from the
remote machine arrives after it was assumed to be dead the master should
have enough information to know it had created a conflict by assigning the
region to another server. An even better solution, I think, is for the RS to
periodically test that it is indeed the rightful owner of every region it
holds and relinquish control over the region if it's not.
Obviously a state where two RSs hold the same region is pathological and can
lead to data loss, as demonstrated in my case. The system should be able to
actively protect itself against such a scenario. It probably doesn't need
saying but there is really nothing worse for a data storage system than data
loss.
In my case the problem didn't happen in the initial phase but after
disabling and enabling a table with about 12K regions.
Fix Version/s: (was: 0.90.4)
Summary: Making region assignment more robust (was:
ServerShutdownHandler.FindDaughterVisitor doesn't detect whether daughter
region is assigned to some server)
> Making region assignment more robust
> ------------------------------------
>
> Key: HBASE-4060
> URL: https://issues.apache.org/jira/browse/HBASE-4060
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.90.3
> Reporter: Ted Yu
> Assignee: Ted Yu
>
> From Eran Kutner:
> My concern is that the region allocation process seems to rely too much on
> timing considerations and doesn't seem to take enough measures to guarantee
> conflicts do not occur. I understand that in a distributed environment, when
> you don't get a timely response from a remote machine you can't know for
> sure if it did or did not receive the request, however there are things that
> can be done to mitigate this and reduce the conflict time significantly. For
> example, when I run dbck it knows that some regions are multiply assigned,
> the master could do the same and try to resolve the conflict. Another
> approach would be to handle late responses, even if the response from the
> remote machine arrives after it was assumed to be dead the master should
> have enough information to know it had created a conflict by assigning the
> region to another server. An even better solution, I think, is for the RS to
> periodically test that it is indeed the rightful owner of every region it
> holds and relinquish control over the region if it's not.
> Obviously a state where two RSs hold the same region is pathological and can
> lead to data loss, as demonstrated in my case. The system should be able to
> actively protect itself against such a scenario. It probably doesn't need
> saying but there is really nothing worse for a data storage system than data
> loss.
> In my case the problem didn't happen in the initial phase but after
> disabling and enabling a table with about 12K regions.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira