Tziporet Koren wrote: > BOYRIE Fabrice wrote: > >> Hello >> >> Hoping I'm in the good mailing list. >> I've a problem with ofed 1.4.2 on Centos 5.3. >>
Salut Fabrice! Does it also happen with OFED 1.5 alpha? Thanks. -jeff >> >> We have a new cluster with QDR infiniband. >> I've installed ofed from source using the install.pl script with the >> default values. >> I've used default kernel from Centos (2.6.18-128.7.1.el5) >> When a node starts, openibd and opensmd services are launched. >> >> >> Infiniband is working >> >> ibv_devinfo >> hca_id: mlx4_0 >> fw_ver: 2.6.000 >> node_guid: 0002:c903:0004:3efc >> sys_image_guid: 0002:c903:0004:3eff >> vendor_id: 0x02c9 >> vendor_part_id: 26428 >> hw_ver: 0xA0 >> board_id: MT_0C40110009 >> phys_port_cnt: 1 >> port: 1 >> state: PORT_ACTIVE (4) >> max_mtu: 2048 (4) >> active_mtu: 2048 (4) >> sm_lid: 9 >> port_lid: 17 >> port_lmc: 0x00 >> >> If I launch MPI program, eg vasp, it works using infiniband transport >> and the performance is good. >> >> So no problem until I want to launch a program not using infiniband: >> Gaussian. >> >> With some big calculus and with %ncpu=8, Gaussian abort with the >> following message >> ntrbks: Input/output error >> I launched several times Gaussian and it always aborted at the same >> point. >> >> If I launch the same gaussian on the same input file on our old >> cluster (same Centos 5.3, same kernel, but without infiniband), it works. >> >> Searching the source code for ntrbks shows a call to fstatfs. >> >> So I've straced Gaussian on the two clusters. Here is the relevant >> part. >> >> New cluster: >> >> [pid 5715] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe", >> ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200", >> "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.i"...,"0", >> "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.r"..., "0", >> "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.d"..., "0", >> "/tmp/CpRh_H_Ph_EneTS1/Gau-5715.s"..., "0", >> "/tmp/CpRh_H_Ph_EneTS1/Gau-5714.i"..., "0", "junk.out", "0", ...], >> [/* 65 vars */] PANIC: attached pid 5816 exited with 0 >> [pid 5715] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5 >> [pid 5715] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768, >> f_blocks=13562292, f_bfree=12353558, f_bavail=12353558, >> f_files=434124416, f_ffree=434090957, f_fsid={0, 0}, f_namelen=255, >> f_frsize =32768}) = 0 >> [pid 5715] read(5, "\10\0\0\0\0\0\0\0", 8) = 8 >> [pid 5715] read(5, >> "\10\0\0\0\0\0\0\0\0\320\10\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0 >> \0\0"..., 320032) = 320032 >> [pid 5715] fstatfs(5, 0x7fff553be6d0) = -1 EIO (Input/output error) >> >> >> Old cluster: >> >> [pid 8605] execve("/opt/Gaussian/g03_e01-pgf//g03/l1002.exe", >> ["/opt/Gaussian/g03_e01-pgf//g03/l"..., "1258291200", >> "CpRh_H_Ph_EneTS1.chk", "1", "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", >> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", >> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", >> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", >> "/tmp/CpRh_H_Ph_EneTS1-8-pgf-2/Ga"..., "0", "junk.out", "0", ...], [/* 8 >> 2 vars */]PANIC: attached pid 8701 exited with 0 >> [pid 8605] open("CpRh_H_Ph_EneTS1.chk", O_RDWR) = 5 >> >> [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768, >> f_blocks=9150944, f_bfree=686850, f_bavail=686850, f_files=88123232, >> f_ffree=87917195, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0 >> [pid 8605] read(5, "'\0\0\0\0\0\0\0", 8) = 8 >> [pid 8605] read(5, >> "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0 >> \0"..., 320032) = 320032 >> [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768, >> f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232, >> f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0 >> [pid 8605] fstatfs(5, {f_type="NFS_SUPER_MAGIC", f_bsize=32768, >> f_blocks=9150944, f_bfree=683022, f_bavail=683022, f_files=87633232, >> f_ffree=87427156, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0 >> [pid 8605] write(5, >> "'\0\0\0\0\0\0\0\0\360$\0\0\0\0\0\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0 >> \0"..., >> 320032) = 320032 >> [pid 8605] close(5) = 0 >> >> >> >> >> If I put the input file on a local directory instead of a nfs one, >> Gaussian works. >> There is no messages in dmesg or in /var/log directory on the node or on >> the nfs server. >> >> On the node, /home is mounted as >> 192.168.1.100:/home on /home type nfs >> (rw,nosuid,rsize=32768,proto=tcp,addr=192.168.1.100) >> >> 192.168.1.xxx is the ethernet network (the nfs server has not >> infiniband card). >> On the node, it is enough to do >> «ifconfig ib0 down >> service opensmd stop >> » to have Gaussian working on the nfs directory. >> >> («ifconfig ib0 down» or «service opensmd stop» alone is not enough) >> >> >> So it seems there is an interaction between nfs access and openfabric. >> But why ? And how to solve it ? >> >> >> >> >> > It seems issues of NFS/RDMA backports. > Can you install OFED without NFS/RDMA? > You can change the conf file for this > > Jon/Steve/Jeff - are you familiar with this issue? > > Tziporet > _______________________________________________ > general mailing list > general@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general