Recently we upgraded all of our lustre clients from 1.4.11.1 to 1.6.3 
while remaining on 1.4.11.1 on all the MDS/OSS:es (which will be upgraded 
later on). This has resulted in a never previously observed error when 
running a specific application (ccsm3) on this lustre filesystem: Opening 
the same file from a handful of compute nodes results in some/all of the 
corresponding lustre clients getting evicted and the open system call to 
fail.

When triggered, any processes from any node that tries to stat the 
directory where this file is opened will also get evicted and freeze for 
100 s. (I include output from /var/log/messages last in this mail., 
apologies for the long mail...).

I am interested in knowing whether this problem is only triggered when 
using 1.6.3 clients against 1.4.11.1 servers, or if it might be triggered 
for more similar client/server versions? Or if there is any way to prevent 
this from happening without upgrading servers or downgrading clients. (I 
will later try to reproduce this using 1.4.11.1 clients)

Our setup:
   1 MDS + 3 OSS: Centos4 with 2.6.9-55.0.9.EL_lustre.1.4.11.1smp
   clients:       Centos4 with 2.6.9-55.0.9.EL_lustre.1.6.3smp
   (Lustre over Gigabit ethernet, no failover)

We have narrowed this down to 2 similar examples which both triggers this 
error:

one in fortran:
      implicit none
      integer i

      do i=1,1000
         open(unit=16,file='tmpfile')
         close(16)
      enddo
      stop
      end

and one in C:
   #include <fcntl.h>
   #include <unistd.h>
   #include <stdio.h>
   #include <stdlib.h>
   
   int main(void) {
     int i, fd;
     char cwd[4096];
     char tmpfile[4196];
   
     for (i = 0; i < 1000; i++) {
       if (getcwd(cwd, 4096) == NULL)
         exit(-1);
       sprintf(tmpfile,"%s/tmpfile", cwd);
       fd = open(tmpfile, O_RDWR|O_CREAT, 0666);
       close(fd);
     }
     return 0;
   }

We can trigger the error by executing one of the examples on at least 3 
nodes in parallel. i.e.
   1. copy the binary to its own directory on the lustre filesystem.
   2. execute the binary in parallel on at least 3 nodes.

Example:
   gcc evict.c -o evict
   mkdir /lustrefs/tmpdir
   cp evict /lustrefs/tmpdir
   pdsh -f3 -w n[1-3] 'cd /lustrefs/tmpdir; ./evict'

/var/log/messages on evicted client:
   Dec 14 16:31:26 n75 kernel: LustreError: 11-0: an error occurred while 
communicating with [EMAIL PROTECTED] The ldlm_enqueue operation failed with -107
   Dec 14 16:31:26 n75 kernel: Lustre: 
MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service 
misu2_mds via nid [EMAIL PROTECTED] was lost; in progress operations us
   ing this service will wait for recovery to complete.
   Dec 14 16:31:26 n75 kernel: LustreError: 167-0: This client was evicted by 
misu2_mds; in progress operations using this service will fail.
   Dec 14 16:31:26 n75 kernel: LustreError: 
5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
   Dec 14 16:31:26 n75 kernel: LustreError: 
5484:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  [EMAIL 
PROTECTED] x854/t0 o101->[EMAIL PROTECTED]:12 lens 392/728 re
   f 1 fl Rpc:/0/0 rc 0/0
   Dec 14 16:31:26 n75 kernel: Lustre: 
MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to 
service misu2_mds using nid [EMAIL PROTECTED]
   Dec 14 16:33:06 n75 kernel: LustreError: 11-0: an error occurred while 
communicating with [EMAIL PROTECTED] The ldlm_enqueue operation failed with -107
   Dec 14 16:33:06 n75 kernel: Lustre: 
MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service 
misu2_mds via nid [EMAIL PROTECTED] was lost; in progress operations us
   ing this service will wait for recovery to complete.
   Dec 14 16:33:06 n75 kernel: LustreError: 167-0: This client was evicted by 
misu2_mds; in progress operations using this service will fail.
   Dec 14 16:33:06 n75 kernel: LustreError: 
5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
   Dec 14 16:33:06 n75 kernel: LustreError: 
5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) Skipped 1 previous similar message
   Dec 14 16:33:06 n75 kernel: Lustre: 
MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to 
service misu2_mds using nid [EMAIL PROTECTED]
   ....<snip>..
   
/var/log/messages on MDS:
   Dec 14 16:31:26 mds2 kernel: LustreError: 
0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer 
expired: evicting client [EMAIL PROTECTED] nid [EMAIL PROTECTED]  ns: 
mds-misu2_mds_UUID lock: 0000010075f50700/0x959917d3f3b88931 lrc: 1/0,0 mode: 
CR/CR res: 4780751/2726154257 bits 0x3 rrc: 7 type: IBT flags: 30 remote: 
0x2a6036f763199bff expref: 7 pid 5373
   Dec 14 16:31:26 mds2 kernel: LustreError: 
0:0:(ldlm_lockd.c:205:waiting_locks_callback()) Skipped 1 previous similar 
message
   Dec 14 16:31:26 mds2 kernel: Lustre: 
5376:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for 
disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0
   Dec 14 16:31:26 mds2 kernel: LustreError: 
5394:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from 
[EMAIL PROTECTED]
   Dec 14 16:31:26 mds2 kernel: LustreError: 
5394:0:(handler.c:1478:mds_handle()) Skipped 1 previous similar message
   Dec 14 16:31:26 mds2 kernel: LustreError: 
5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107)  
[EMAIL PROTECTED] x852/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0 
rc -107/0
   Dec 14 16:31:26 mds2 kernel: LustreError: 
5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) Skipped 2 previous similar 
messages
   Dec 14 16:33:06 mds2 kernel: LustreError: 
0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer 
expired: evicting client [EMAIL PROTECTED] nid [EMAIL PROTECTED]  ns: 
mds-misu2_mds_UUID lock: 00000100b829fb00/0x959917d3f3b89ff1 lrc: 1/0,0 mode: 
CR/CR res: 4780751/2726154257 bits 0x3 rrc: 8 type: IBT flags: 30 remote: 
0x2a6036f763199c3e expref: 6 pid 5398
   Dec 14 16:33:06 mds2 kernel: Lustre: 
5373:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for 
disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0
   Dec 14 16:33:06 mds2 kernel: LustreError: 
5372:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from 
[EMAIL PROTECTED]
   Dec 14 16:33:06 mds2 kernel: LustreError: 
5372:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107)  
[EMAIL PROTECTED] x927/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0 
rc -107/0
   ....<snip>...

Thanks for any help/input,
   
/Per Lundqvist

-- 
Per Lundqvist

National Supercomputer Centre
Linköping University, Sweden

http://www.nsc.liu.se
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to