According to the log, you're getting IBV_WC_WR_FLUSH returned by the check_cq fuction which does all the polling for openIB.
The IB spec says this about the error:
"Work Request Flushed Error - A Work Request was in process or outstanding when the QP transitioned into the Error State."

It doesnt go any further into the details of this error, but generally whenever the QP is sent into an error state, it is considered to be a fatal error by most of the IB community. (correct me if I'm wrong, please) This leads me to believe that you may still have underlying network problems. Have you been able to successfully run the various openIB test programs like ibv_rc_pingpong() or possibly tried the latest NetPIPE release which has openIB support (it may not give a pretty answer other than crashing if you have network problems though :-/ )

If the network ends up not being the problem, we've got a serious problem here in the code, as we should never be putting the QP into erroneous states.

Also, pete, the spec doesnt say anything about having async errors being flagged for an error like this, is this a case where we might be able to get useful information about the QP before or as it goes into an error state via async events?

Kyle

Kyle Schochenmaier wrote:
This is actually an error propagating up from openIB, not pvfs. I've never seen the error before, and I'm not sure if it is a fatal error or something that we can handle inside pvfs, I'll have to look at the IB spec and see if we can generate a patch for this.

[E 15:44:43.719270] Error: ib_check_cq: entry id 0x5c4e70 opcode RECV
error IBV_WC_WR_FLUSH_ERR.

Kyle


Tad Kollar wrote:
Pete Wyckoff wrote:
Have you been able to use, say, pvfs2-cp to put files into PVFS over
IB?  That will help us know if it's a kernel problem or an IB
problem, perhaps.
After getting your reply I set up a test that used pvfs2-cp to copy a
2.5G file back and forth a total of 30 times. During that process,
pvfs2-cp generated these three errors, always during the read back from
the pvfs2 fs:

[E 15:44:43.719270] Error: ib_check_cq: entry id 0x5c4e70 opcode RECV
error IBV_WC_WR_FLUSH_ERR.
[E 15:44:43.924115]     [bt] pvfs2-cp(error+0xca) [0x44a1ca]
[E 15:44:43.924161]     [bt] pvfs2-cp [0x448dc3]
[E 15:44:43.924171]     [bt] pvfs2-cp [0x4492c6]
[E 15:44:43.924179]     [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
[E 15:44:43.924187]     [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
[0x43c054]
[E 15:44:43.924195]     [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
[E 15:44:43.924204]     [bt]
pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
[E 15:44:43.924211]     [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
[E 15:44:43.924220]     [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
[E 15:44:43.924228]     [bt] pvfs2-cp(main+0x372) [0x40d792]
[E 15:44:43.924236]     [bt] /lib/libc.so.6(__libc_start_main+0xda)
[0x2aaaab0784ca]

[E 09:06:20.511281] Error: ib_check_cq: entry id 0x5e83f0 opcode RECV
error IBV_WC_WR_FLUSH_ERR.
[E 09:06:21.104063]     [bt] pvfs2-cp(error+0xca) [0x44a1ca]
[E 09:06:21.104112]     [bt] pvfs2-cp [0x448dc3]
[E 09:06:21.104120]     [bt] pvfs2-cp [0x4492c6]
[E 09:06:21.104128]     [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
[E 09:06:21.104136]     [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
[0x43c054]
[E 09:06:21.104143]     [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
[E 09:06:21.104151]     [bt]
pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
[E 09:06:21.104158]     [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
[E 09:06:21.104165]     [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
[E 09:06:21.104173]     [bt] pvfs2-cp(main+0x372) [0x40d792]
[E 09:06:21.104181]     [bt] /lib/libc.so.6(__libc_start_main+0xda)
[0x2aaaab0784ca]

[E 09:09:46.596001] Error: ib_check_cq: entry id 0x5c4cc0 opcode RECV
error IBV_WC_WR_FLUSH_ERR.
[E 09:09:47.109736]     [bt] pvfs2-cp(error+0xca) [0x44a1ca]
[E 09:09:47.109790]     [bt] pvfs2-cp [0x448dc3]
[E 09:09:47.109799]     [bt] pvfs2-cp [0x4492c6]
[E 09:09:47.109807]     [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
[E 09:09:47.109816]     [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
[0x43c054]
[E 09:09:47.109823]     [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
[E 09:09:47.109831]     [bt]
pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
[E 09:09:47.109840]     [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
[E 09:09:47.109847]     [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
[E 09:09:47.109856]     [bt] pvfs2-cp(main+0x372) [0x40d792]
[E 09:09:47.109863]     [bt] /lib/libc.so.6(__libc_start_main+0xda)
[0x2aaaab0784ca]
The other interesting thing to know is if you can recofigure PVFS to
use only TCP, then run your bonnie test and get the same error.
Except for IB testing, I've had TCP specified in the pvfs2tab and mount
options and haven't been able to disrupt it; is that sufficient or
should I remove all references to IB? I repeated the pvfs2-cp using TCP
and didn't receive any errors.

Tad
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users




_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

!DSPAM:460006b4105153366512726!


_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to