On 11/08/2016 16:10, Colin Faber wrote:
First glance indicates you're having network connectivity problems,
(possibly driver issue with your NIC?)

I don't seem to have had any problems with any other services running on the cluster, and there are no messages in the journal or the /var/log files relating to network errors.

Oddly though when the /home filesystem hangs the /storage and /scratch filesystems also served by the same luster servers continue to respond
without problems.

What does semm top have some bearing on it is that the first few writes seem to succeed and then it will hang, though it was first noticed through samba, it also appears to also happen logged in to the console directly.

(Check MTU settings, etc?)

Pasting as quotation as it stops thunderbird from wrapping the text.....

root@test-r710:~# ifconfig
eno1      Link encap:Ethernet  HWaddr 00:26:b9:84:c7:8d
          inet addr:192.168.1.80  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::226:b9ff:fe84:c78d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8516 errors:0 dropped:0 overruns:0 frame:0
          TX packets:23199 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5297958 (5.2 MB)  TX bytes:3222616 (3.2 MB)

eno2      Link encap:Ethernet  HWaddr 00:26:b9:84:c7:8f
          inet addr:192.168.0.80  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::226:b9ff:fe84:c78f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1374513 errors:0 dropped:0 overruns:0 frame:0
          TX packets:168485 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2026863011 (2.0 GB)  TX bytes:21861558 (21.8 MB)

eno4      Link encap:Ethernet  HWaddr 00:26:b9:84:c7:93
          inet addr:137.205.232.159  Bcast:137.205.232.255  Mask:255.255.255.128
          inet6 addr: fe80::226:b9ff:fe84:c793/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:11483 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10560 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3504764 (3.5 MB)  TX bytes:5731764 (5.7 MB)


root@test-r710:~# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         137.205.232.254 0.0.0.0         UG    0      0        0 eno4
137.205.232.128 0.0.0.0         255.255.255.128 U     0      0        0 eno4
192.168.0.0     0.0.0.0         255.255.255.0   U     0      0        0 eno2
192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 eno1

Lustre mounts in fstab :> # Lustre mounted
192.168.0.4@tcp0:/storage       /storage        lustre  defaults,_netdev,flock 
0 0
192.168.0.4@tcp0:/home          /home           lustre  defaults,_netdev,flock 
0 0
192.168.0.4@tcp0:/scratch       /scratch        lustre  defaults,_netdev,flock 
0 0

I've also tried compiling the latest source and installing those modules : Lustre: Build Version: 2.8.56_26_g6fad3ab this does seem not to have the problem with matlab (mentioned about a month or so ago), but still has the hanging problem.

The lustre startup logs in the joural are here :
Aug 12 12:57:10 test-r710 kernel: Lustre: Lustre: Build Version: 
2.8.56_26_g6fad3ab
Aug 12 12:57:10 test-r710 kernel: Lustre: Server MGS version (2.1.0.0) is much 
older than client. Consider upgrading server (2.8.56_26_g6fad3ab)
Aug 12 12:57:10 test-r710 kernel: Lustre: Trying to mount a client with IR 
setting not compatible with current mgc. Force to use current mgc setting that 
is IR disabled.
Aug 12 12:57:10 test-r710 kernel: Lustre: Mounted home-client


Cheers.

Phill.



Cheers.

Phill.



_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to