[
https://issues.apache.org/jira/browse/HDFS-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307458#comment-17307458
]
Konstantin Shvachko commented on HDFS-15915:
--------------------------------------------
# Suppose two {{mkdirs}} for the same path are running on the Active NameNode
at the same time. Assume that the path does not exist yet and that the two RPCs
are coming from two clients c1 and c2.
# Then one of them, e.g. c1, will create the directory in memory and generate
the respective transaction {{MkdirOp}}, which has all the fields except for
{{txid}}. Then it will enqueue the transaction in
{{FSEditLogAsync.logEdit(op)}} for further asynchronous processing. The handler
thread processing this rpc from c1 is now free to release the write lock and
give control to other threads.
# {{FSEditLogAsync.run()}} will asynchronously process the transaction when it
dequeues it. At that time it will assign the {{txid}} for the transaction, see
{{logEdit() -> doEditTransaction() -> beginTransaction()}}, and increment the
global transaction count {{FSEditLog.txid}}. This can happen either inside or
outside of the namesystem lock. Under heavy load (rare event) the call to
{{logEdit()}} can happen outside the lock. And that causes the problem.
# Now suppose that {{MkdirOp}} has not been processed yet, but the second
{{mkdirs()}} from client c2 started executing. It can proceed because the write
lock has been released. The c2 call will find that the directory already exists
and will return to the client without generating any transactions. In the reply
it will populate {{lastSeenStateId}}. But the stateId will be less than the
txId of the {{MkdirOp}} client c2 just have seen, because this transaction has
not been processed yet and the global tx count {{FSEditLog.txid}} did not
advance.
# Then of course going to ObserverNode with that transaction id can cause stale
read if the client reaches the Observer before it tails the {{MkdirOp}} edit
from the journal.
I managed to reproduce this in a unit test. Attaching. The test spawns a bunch
of {{mkdirs()}} on the same path. Then it mocks {{doEditTransaction()}} to
delay async processing of the mkdir transaction on Active NN. The delay is
sufficient for another {{mkdirs()}} call to pass through and obtain the wrong
{{lastSeenStateId}}. Then one can see {{FileNotFoundException}}, which
indicates stale read from Observer.
_Seems like a straightforward solution is to assign the transaction id at the
time of its creation before it is enqueued. The queue order should guarantee
the same result of the assignment as now, but will avoid the race._
> Race condition with async edits logging due to updating txId outside of the
> namesystem log
> ------------------------------------------------------------------------------------------
>
> Key: HDFS-15915
> URL: https://issues.apache.org/jira/browse/HDFS-15915
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs, namenode
> Reporter: Konstantin Shvachko
> Priority: Major
>
> {{FSEditLogAsync}} creates an {{FSEditLogOp}} and populates its fields inside
> {{FSNamesystem.writeLock}}. But one essential field the transaction id of the
> edits op remains unset until the time when the operation is scheduled for
> synching. At that time {{beginTransaction()}} will set the the
> {{FSEditLogOp.txid}} and increment the global transaction count. On busy
> NameNode this event can fall outside the write lock.
> This causes problems for Observer reads. It also can potentially reshuffle
> transactions and Standby will apply them in a wrong order.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]