[ 
https://issues.apache.org/jira/browse/HDFS-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307458#comment-17307458
 ] 

Konstantin Shvachko commented on HDFS-15915:
--------------------------------------------

# Suppose two {{mkdirs}} for the same path are running on the Active NameNode 
at the same time. Assume that the path does not exist yet and that the two RPCs 
are coming from two clients c1 and c2.
# Then one of them, e.g. c1, will create the directory in memory and generate 
the respective transaction {{MkdirOp}}, which has all the fields except for 
{{txid}}. Then it will enqueue the transaction in 
{{FSEditLogAsync.logEdit(op)}} for further asynchronous processing. The handler 
thread processing this rpc from c1 is now free to release the write lock and 
give control to other threads.
# {{FSEditLogAsync.run()}} will asynchronously process the transaction when it 
dequeues it. At that time it will assign the {{txid}} for the transaction, see 
{{logEdit() -> doEditTransaction() -> beginTransaction()}}, and increment the 
global transaction count {{FSEditLog.txid}}. This can happen either inside or 
outside of the namesystem lock. Under heavy load (rare event) the call to 
{{logEdit()}} can happen outside the lock. And that causes the problem.
# Now suppose that {{MkdirOp}} has not been processed yet, but the second 
{{mkdirs()}} from client c2 started executing. It can proceed because the write 
lock has been released. The c2 call will find that the directory already exists 
and will return to the client without generating any transactions. In the reply 
it will populate {{lastSeenStateId}}. But the stateId will be less than the 
txId of the {{MkdirOp}} client c2 just have seen, because this transaction has 
not been processed yet and the global tx count {{FSEditLog.txid}} did not 
advance.
# Then of course going to ObserverNode with that transaction id can cause stale 
read if the client reaches the Observer before it tails the {{MkdirOp}} edit 
from the journal.

I managed to reproduce this in a unit test. Attaching. The test spawns a bunch 
of {{mkdirs()}} on the same path. Then it mocks {{doEditTransaction()}} to 
delay async processing of the mkdir transaction on Active NN. The delay is 
sufficient for another {{mkdirs()}} call to pass through and obtain the wrong 
{{lastSeenStateId}}. Then one can see {{FileNotFoundException}}, which 
indicates stale read from Observer.

_Seems like a straightforward solution is to assign the transaction id at the 
time of its creation before it is enqueued. The queue order should guarantee 
the same result of the assignment as now, but will avoid the race._

> Race condition with async edits logging due to updating txId outside of the 
> namesystem log
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15915
>                 URL: https://issues.apache.org/jira/browse/HDFS-15915
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs, namenode
>            Reporter: Konstantin Shvachko
>            Priority: Major
>
> {{FSEditLogAsync}} creates an {{FSEditLogOp}} and populates its fields inside 
> {{FSNamesystem.writeLock}}. But one essential field the transaction id of the 
> edits op remains unset until the time when the operation is scheduled for 
> synching. At that time {{beginTransaction()}} will set the the 
> {{FSEditLogOp.txid}} and increment the global transaction count. On busy 
> NameNode this event can fall outside the write lock. 
> This causes problems for Observer reads. It also can potentially reshuffle 
> transactions and Standby will apply them in a wrong order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to