On Oct 23, 2006, at 8:42 AM, Hoang-Nam Nguyen wrote: > Hello Troy! >> The netpipe code is available with mercurial by: >> hg clone http://source.scl.ameslab.gov/hg/netpipe3-pvfs-dev >> Once you have pvfs2-1.5.1 installed, you should be able to do 'make >> pvfs' in the netpipe3-pvfs-dev directory and build NPpvfs. >> The command line arguments I used to reproduce this were: >> ./NPpvfs -d $PVFS_FILE_PATH -l 32768 -u 268435456 -n 100 -o >> $NETPIPE_OUTPUT_FILE > Did you compile pvfs and NPpvfs as 32-bit or 64-bit libs/execs? > I did compile pvfs and NPpvfs as is and realized that pvfs is built > by default as 32-bit and NPpvfs as 64-bit. Hence NPpvfs complained > to find incompatible pvfs libs. > Regards > Nam >
I wasn't able to get reliable backtraces out of a 64 bit NPpvfs and pvfs libs, so I rebuilt as 32 bit, and now I get much more interesting errors and kernel logs.. If I start 4 netpipe processes on the same node with: ./NPpvfs -l 32768 -u 268435456 -n 100 -o results/proc2.w.out -I -d / pvfs2/6node/proc2 I get errors like: 27: 786429 bytes 100 times --> 2249.96 Mbps in 2666.70 usec 28: 786432 bytes 100 times --> [E 18:47:20.394586] Error: ib_check_cq: entry id 0x100ac7f0 opcode RDMA WRITE error IBV_WC_LOC_PROT_ERR. [E 18:47:20.395051] [bt] ./NPpvfs(error+0x9c) [0x1005858c] [E 18:47:20.395087] [bt] ./NPpvfs [0x10056a00] [E 18:47:20.395118] [bt] ./NPpvfs [0x1005726c] And kernel logs like this: Oct 23 18:48:37 p5l8 kernel: PU0007 00060066:print_error_data HCAD_ERROR QP 0xdfe (resource=2000000000000dfe) has errors. Oct 23 18:48:37 p5l8 kernel: PU0007 00060077:print_error_data HCAD_ERROR Error data is available: 2000000000000dfe. Oct 23 18:48:37 p5l8 kernel: PU0007 00060079:print_error_data HCAD_ERROR EHCA ----- error data begin --------------------------------------------------- Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f000 ofs=0000 00000000000004d0 2000000000000dfe Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f010 ofs=0010 0100000000000310 8000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f020 ofs=0020 a000000500000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f030 ofs=0030 0000000001000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f040 ofs=0040 0000000000000001 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f050 ofs=0050 0000000000000014 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f060 ofs=0060 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f070 ofs=0070 000000000000ffff 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f080 ofs=0080 008000000000262b 0000000000ffffff Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f090 ofs=0090 0000000000ffffff 0000000009f49900 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f0a0 ofs=00a0 00000000000e0492 000000000000000a Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f0b0 ofs=00b0 0000000000000001 000000000000002b Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f0c0 ofs=00c0 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f0d0 ofs=00d0 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f0e0 ofs=00e0 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f0f0 ofs=00f0 0000000000000000 0000000000000003 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f100 ofs=0100 000000000000001a 0000000000000004 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f110 ofs=0110 0000000000000004 0000000000000032 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f120 ofs=0120 00000000dc9d4600 0000000003c32f28 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f130 ofs=0130 000000000009f4aa 000000000009f4aa Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f140 ofs=0140 0a00000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f150 ofs=0150 0000000000000002 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f160 ofs=0160 0000000000002633 000000000000262c Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f170 ofs=0170 0000000000000001 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f180 ofs=0180 0000000000000006 0000000000000004 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f190 ofs=0190 0000000000000004 00000001da05023d Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f1a0 ofs=01a0 000000000000001f 000000000000262b Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f1b0 ofs=01b0 00000000dc9d4600 0000000003c32f28 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f1c0 ofs=01c0 0000000000000001 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f1d0 ofs=01d0 00000000dc9e5600 0000000003c33328 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f1e0 ofs=01e0 0000000000000006 0000000000000001 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f1f0 ofs=01f0 0000000000000003 000000000000262c Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f200 ofs=0200 000000000009f499 0000000000000004 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f210 ofs=0210 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f220 ofs=0220 0000000000000000 0000003000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f230 ofs=0230 0000000000000002 000000000000262b Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f240 ofs=0240 0000000000000000 000000000009f4a9 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f250 ofs=0250 00000000e3e9f820 0000000000000106 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f260 ofs=0260 0000000000000106 0000000000000003 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f270 ofs=0270 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f280 ofs=0280 008000000000262b 0000000000ffffff Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f290 ofs=0290 0000000000ffffff 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f2a0 ofs=02a0 000000000000262c 8000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f2b0 ofs=02b0 09f22a0000000000 3808000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f2c0 ofs=02c0 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f2d0 ofs=02d0 0000000000000000 2000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f2e0 ofs=02e0 8000000000000000 3808000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f2f0 ofs=02f0 0000000000000000 6800000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f300 ofs=0300 a800000000000000 0000003000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f310 ofs=0310 4000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f320 ofs=0320 0000000000000000 02000000000000c8 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f330 ofs=0330 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f340 ofs=0340 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f350 ofs=0350 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f360 ofs=0360 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f370 ofs=0370 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f380 ofs=0380 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f390 ofs=0390 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f3a0 ofs=03a0 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f3b0 ofs=03b0 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f3c0 ofs=03c0 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f3d0 ofs=03d0 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f3e0 ofs=03e0 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f3f0 ofs=03f0 0000000000000000 0400000000000060 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f400 ofs=0400 8000000000000000 c000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f410 ofs=0410 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f420 ofs=0420 0000000003c2a383 00000000d7bc8280 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f430 ofs=0430 000000000000043f 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f440 ofs=0440 0000000000000000 0003000000000004 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f450 ofs=0450 0000000000000004 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f460 ofs=0460 0300000000000068 8040000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f470 ofs=0470 c000c00000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f480 ofs=0480 0000000000000000 0000000003c4ae81 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f490 ofs=0490 00000000fbe4f960 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f4a0 ofs=04a0 0000000000000000 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f4b0 ofs=04b0 0000000000000000 0000000000000004 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data resource=2000000000000dfe adr=c00000012ec3f4c0 ofs=04c0 0000000000000004 0000000000000000 Oct 23 18:48:37 p5l8 kernel: PU0007 0006007c:print_error_data HCAD_ERROR EHCA ----- error data end ---------------------------------------------------- _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
