[
https://issues.apache.org/jira/browse/HDFS-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868782#action_12868782
]
Todd Lipcon commented on HDFS-1142:
-----------------------------------
Hey Konstantin,
I agree that this shouldn't be marked blocker while discussion is going on.
Let me better explain the context with regards to HBase. HBase uses ZK already
to determine regionserver liveness. If a region server dies, it loses its ZK
session, and thus an ephemeral znode disappears. The master notices this,
initiates commitlog recovery for that server, and eventually reassigns the
regions elsewhere. To provide proper database-like semantics, we need to ensure
that once log recovery commences, the regionserver cannot write any more to
that log (otherwise writes might be lost forever).
Of course this all works fine if the regionserver has truly died. A big issue
we face, though, is one of long garbage collection pauses (sound familiar?). In
some cases, the pauses can last longer than the zk session timeout. Thus, the
hbase master decides that the server has died and does log splitting, region
reassignment, etc. Unfortunately, in this scenario, the region server then
comes back to life and flushes a few more writes to the log file, which
summarily get lost forever even though the client thinks they're committed. The
regionserver eventually "notices" that it lost its ZK session and shuts itself
down, but in practice it often has time to get off some last edits before doing
so.
Clearly, using locks in ZK is subject to the same issue above - the issue is
that our ZK coordination is not synchronous with our storage access.
There are two solutions I can think of here: (a) the "STONITH" technique (
http://en.wikipedia.org/wiki/STONITH ) - we could run the regionservers in a
container service which allows us to kill -9 the regionserver when we think it
should be dead. But this is obviously more complicated with regard to
deployment, additional RPCs, etc. (b) file access revocation - this is what
we're trying to do with lease recovery and what you're suggesting should not be
possible.
Here's a question - as you described it, the original lease holder and the
recovering lease holder race to recover the lease. If the original holder wins
the recovery, are we guaranteed that no interceding appends have occurred? eg
what happens if the recovering process wins, opens the file for append, and
immediately closes it. Are we guaranteed then that another flush() call from
the client at that point would definitely fail, or can it transparently regain
the lease from the now-closed file?
> Lease recovery doesn't reassign lease when triggered by append()
> ----------------------------------------------------------------
>
> Key: HDFS-1142
> URL: https://issues.apache.org/jira/browse/HDFS-1142
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: name-node
> Affects Versions: 0.21.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Attachments: hdfs-1142.txt, hdfs-1142.txt
>
>
> If a soft lease has expired and another writer calls append(), it triggers
> lease recovery but doesn't reassign the lease to a new owner. Therefore, the
> old writer can continue to allocate new blocks, try to steal back the lease,
> etc. This is for the testRecoveryOnBlockBoundary case of HDFS-1139
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.