Recently we upgraded all of our lustre clients from 1.4.11.1 to 1.6.3
while remaining on 1.4.11.1 on all the MDS/OSS:es (which will be upgraded
later on). This has resulted in a never previously observed error when
running a specific application (ccsm3) on this lustre filesystem: Opening
the same file from a handful of compute nodes results in some/all of the
corresponding lustre clients getting evicted and the open system call to
fail.
When triggered, any processes from any node that tries to stat the
directory where this file is opened will also get evicted and freeze for
100 s. (I include output from /var/log/messages last in this mail.,
apologies for the long mail...).
I am interested in knowing whether this problem is only triggered when
using 1.6.3 clients against 1.4.11.1 servers, or if it might be triggered
for more similar client/server versions? Or if there is any way to prevent
this from happening without upgrading servers or downgrading clients. (I
will later try to reproduce this using 1.4.11.1 clients)
Our setup:
1 MDS + 3 OSS: Centos4 with 2.6.9-55.0.9.EL_lustre.1.4.11.1smp
clients: Centos4 with 2.6.9-55.0.9.EL_lustre.1.6.3smp
(Lustre over Gigabit ethernet, no failover)
We have narrowed this down to 2 similar examples which both triggers this
error:
one in fortran:
implicit none
integer i
do i=1,1000
open(unit=16,file='tmpfile')
close(16)
enddo
stop
end
and one in C:
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
int main(void) {
int i, fd;
char cwd[4096];
char tmpfile[4196];
for (i = 0; i < 1000; i++) {
if (getcwd(cwd, 4096) == NULL)
exit(-1);
sprintf(tmpfile,"%s/tmpfile", cwd);
fd = open(tmpfile, O_RDWR|O_CREAT, 0666);
close(fd);
}
return 0;
}
We can trigger the error by executing one of the examples on at least 3
nodes in parallel. i.e.
1. copy the binary to its own directory on the lustre filesystem.
2. execute the binary in parallel on at least 3 nodes.
Example:
gcc evict.c -o evict
mkdir /lustrefs/tmpdir
cp evict /lustrefs/tmpdir
pdsh -f3 -w n[1-3] 'cd /lustrefs/tmpdir; ./evict'
/var/log/messages on evicted client:
Dec 14 16:31:26 n75 kernel: LustreError: 11-0: an error occurred while
communicating with [EMAIL PROTECTED] The ldlm_enqueue operation failed with -107
Dec 14 16:31:26 n75 kernel: Lustre:
MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service
misu2_mds via nid [EMAIL PROTECTED] was lost; in progress operations us
ing this service will wait for recovery to complete.
Dec 14 16:31:26 n75 kernel: LustreError: 167-0: This client was evicted by
misu2_mds; in progress operations using this service will fail.
Dec 14 16:31:26 n75 kernel: LustreError:
5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
Dec 14 16:31:26 n75 kernel: LustreError:
5484:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL
PROTECTED] x854/t0 o101->[EMAIL PROTECTED]:12 lens 392/728 re
f 1 fl Rpc:/0/0 rc 0/0
Dec 14 16:31:26 n75 kernel: Lustre:
MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to
service misu2_mds using nid [EMAIL PROTECTED]
Dec 14 16:33:06 n75 kernel: LustreError: 11-0: an error occurred while
communicating with [EMAIL PROTECTED] The ldlm_enqueue operation failed with -107
Dec 14 16:33:06 n75 kernel: Lustre:
MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service
misu2_mds via nid [EMAIL PROTECTED] was lost; in progress operations us
ing this service will wait for recovery to complete.
Dec 14 16:33:06 n75 kernel: LustreError: 167-0: This client was evicted by
misu2_mds; in progress operations using this service will fail.
Dec 14 16:33:06 n75 kernel: LustreError:
5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
Dec 14 16:33:06 n75 kernel: LustreError:
5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) Skipped 1 previous similar message
Dec 14 16:33:06 n75 kernel: Lustre:
MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to
service misu2_mds using nid [EMAIL PROTECTED]
....<snip>..
/var/log/messages on MDS:
Dec 14 16:31:26 mds2 kernel: LustreError:
0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer
expired: evicting client [EMAIL PROTECTED] nid [EMAIL PROTECTED] ns:
mds-misu2_mds_UUID lock: 0000010075f50700/0x959917d3f3b88931 lrc: 1/0,0 mode:
CR/CR res: 4780751/2726154257 bits 0x3 rrc: 7 type: IBT flags: 30 remote:
0x2a6036f763199bff expref: 7 pid 5373
Dec 14 16:31:26 mds2 kernel: LustreError:
0:0:(ldlm_lockd.c:205:waiting_locks_callback()) Skipped 1 previous similar
message
Dec 14 16:31:26 mds2 kernel: Lustre:
5376:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for
disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0
Dec 14 16:31:26 mds2 kernel: LustreError:
5394:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from
[EMAIL PROTECTED]
Dec 14 16:31:26 mds2 kernel: LustreError:
5394:0:(handler.c:1478:mds_handle()) Skipped 1 previous similar message
Dec 14 16:31:26 mds2 kernel: LustreError:
5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107)
[EMAIL PROTECTED] x852/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0
rc -107/0
Dec 14 16:31:26 mds2 kernel: LustreError:
5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) Skipped 2 previous similar
messages
Dec 14 16:33:06 mds2 kernel: LustreError:
0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer
expired: evicting client [EMAIL PROTECTED] nid [EMAIL PROTECTED] ns:
mds-misu2_mds_UUID lock: 00000100b829fb00/0x959917d3f3b89ff1 lrc: 1/0,0 mode:
CR/CR res: 4780751/2726154257 bits 0x3 rrc: 8 type: IBT flags: 30 remote:
0x2a6036f763199c3e expref: 6 pid 5398
Dec 14 16:33:06 mds2 kernel: Lustre:
5373:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for
disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0
Dec 14 16:33:06 mds2 kernel: LustreError:
5372:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from
[EMAIL PROTECTED]
Dec 14 16:33:06 mds2 kernel: LustreError:
5372:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107)
[EMAIL PROTECTED] x927/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0
rc -107/0
....<snip>...
Thanks for any help/input,
/Per Lundqvist
--
Per Lundqvist
National Supercomputer Centre
Linköping University, Sweden
http://www.nsc.liu.se_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss