On 17/8/9 23:24, ge changwei wrote:
> On 2017/8/9 下午7:32, Joseph Qi wrote:
>> On 17/8/7 15:13, Changwei Ge wrote:
>>> In current code, while flushing AST, we don't handle an exception that
>>> sending AST or BAST is failed.
>>> But it is indeed possible that AST or BAST is lost due to some kind of
>>> networks fault.
>> Could you please describe this issue more clearly? It is better analyze
>> issue along with the error message and the status of related nodes.
>> IMO, if network is down, one of the two nodes will be fenced. So what's
>> your case here?
> I have posted the status of related lock resource in my preceding email.
> Please check them out.
> Moreover, network is not down forever even not longer than threshold to
> be fenced.
> So no node will be fenced.
> This issue happens in terrible network environment. Some messages may be
> abandoned by switch due to various conditions.
> And even frequent and fast link up and down will also cause this issue.
> In a nutshell, re-queuing AST and BAST is crucial when link between
> nodes recover quickly. It prevents cluster from hanging.
>So you mean the tcp packet is lost due to connection reset? IIRC,
Junxiao has posted a patchset to fix this issue.
If you are using the way of re-queuing, how to make sure the original
message is *truly* lost and the same ast/bast won't be sent twice?
>>> If above exception happens, the requesting node will never obtain an AST
>>> back, hence, it will never acquire the lock or abort current locking.
>>> With this patch, I'd like to fix this issue by re-queuing the AST or
>>> BAST if sending is failed due to networks fault.
>>> And the re-queuing AST or BAST will be dropped if the requesting node is
>>> It will improve the reliability a lot.
>> Ocfs2-devel mailing list
Ocfs2-devel mailing list