Hello Ben,

Thank you for your reply.

Actually, our farm experiences this issue for some time. And we spent a lot of 
time to figure out it. We found when there is large IO throughput to consume 
the network bandwidth and there are many  network package losts, the issue is 
more serious.  After we configured a separate network interface for client 
machines in NetInfo file. This symptom changed better. But the issue still 
exists.

But we all think it does not processed well in this case. The client should not 
be blocked rather than report "timeout" and exit.

The openafs version we used listed below:

Sever side: OpenAFS 1.6.11

Client side: Openafs-1.6.23

Any comments or suggestions will be grateful.


Wishes,
Qiulan



huangql
====================================================================
Computing center,the Institute of High Energy Physics, CAS, China
Qiulan Huang                       Tel: (+86) 10 8823 6087
P.O. Box 918-7                       Fax: (+86) 10 8823 6839
Beijing 100049  P.R. China           Email: huan...@ihep.ac.cn
===================================================================
 
From: Benjamin Kaduk
Date: 2020-04-28 06:29
To: huangql
CC: openafs-info; huqb
Subject: Re: [OpenAFS] Clients are blocked with error code -3 of 
RXAFSCB_ProbeUuid
On Mon, Apr 27, 2020 at 09:16:14AM +0800, huangql wrote:
> Hello All,
> 
> 
> We found some clients blocked. And no more operations are available under 
> /afs instance like “cd”"ls", all of which are blocked.
> 
> We can see some log message on server side to know the error code -3
> 
> 
> Mon Apr 27 08:00:34 2020 CheckHost_r: Probing all interfaces of host 
> 192.168.63.194:7001 failed, code -3
> Mon Apr 27 08:07:37 2020 CheckHost_r: Probing all interfaces of host 
> 192.168.63.219:7001 failed, code -3
> 
> It failed to restart afs service to resume the /afs excepting restarting the 
> client nodes.
> 
> Does someone have the similar cases? Any suggestions would be appreciated. 
> Thanks.
 
That's an interesting error code to be seeing;
https://www.central.org/pages/numbers/errors.html shows -3 as
RX_CALL_TIMEOUT, which does not seem to match your description of the
issue.  A brief glance at the code indicates that we can also generate this
error locally if our clock is moving backwards a lot.
 
I don't expect the above to be helpful, and don't recall any similar cases,
but figured it is better to reply with what little I know than to leave
your message with no reply.
 
-Ben
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to