According to the log, you're getting IBV_WC_WR_FLUSH returned by the
check_cq fuction which does all the polling for openIB.
The IB spec says this about the error:
"Work Request Flushed Error - A Work Request was in process or
outstanding when the QP transitioned into the Error State."
It doesnt go any further into the details of this error, but generally
whenever the QP is sent into an error state,
it is considered to be a fatal error by most of the IB community.
(correct me if I'm wrong, please)
This leads me to believe that you may still have underlying network
problems.
Have you been able to successfully run the various openIB test programs
like ibv_rc_pingpong() or possibly tried the latest NetPIPE release
which has openIB support (it may not give a pretty answer other than
crashing if you have network problems though :-/ )
If the network ends up not being the problem, we've got a serious
problem here in the code, as we should never be putting the QP into
erroneous states.
Also, pete, the spec doesnt say anything about having async errors being
flagged for an error like this, is this a case where we might be able to
get useful information about the QP before or as it goes into an error
state via async events?
Kyle
Kyle Schochenmaier wrote:
This is actually an error propagating up from openIB, not pvfs. I've
never seen the error before, and I'm not sure if it is a fatal error
or something that we can handle inside pvfs, I'll have to look at the
IB spec and see if we can generate a patch for this.
[E 15:44:43.719270] Error: ib_check_cq: entry id 0x5c4e70 opcode RECV
error IBV_WC_WR_FLUSH_ERR.
Kyle
Tad Kollar wrote:
Pete Wyckoff wrote:
Have you been able to use, say, pvfs2-cp to put files into PVFS over
IB? That will help us know if it's a kernel problem or an IB
problem, perhaps.
After getting your reply I set up a test that used pvfs2-cp to copy a
2.5G file back and forth a total of 30 times. During that process,
pvfs2-cp generated these three errors, always during the read back from
the pvfs2 fs:
[E 15:44:43.719270] Error: ib_check_cq: entry id 0x5c4e70 opcode RECV
error IBV_WC_WR_FLUSH_ERR.
[E 15:44:43.924115] [bt] pvfs2-cp(error+0xca) [0x44a1ca]
[E 15:44:43.924161] [bt] pvfs2-cp [0x448dc3]
[E 15:44:43.924171] [bt] pvfs2-cp [0x4492c6]
[E 15:44:43.924179] [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
[E 15:44:43.924187] [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
[0x43c054]
[E 15:44:43.924195] [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
[E 15:44:43.924204] [bt]
pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
[E 15:44:43.924211] [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
[E 15:44:43.924220] [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
[E 15:44:43.924228] [bt] pvfs2-cp(main+0x372) [0x40d792]
[E 15:44:43.924236] [bt] /lib/libc.so.6(__libc_start_main+0xda)
[0x2aaaab0784ca]
[E 09:06:20.511281] Error: ib_check_cq: entry id 0x5e83f0 opcode RECV
error IBV_WC_WR_FLUSH_ERR.
[E 09:06:21.104063] [bt] pvfs2-cp(error+0xca) [0x44a1ca]
[E 09:06:21.104112] [bt] pvfs2-cp [0x448dc3]
[E 09:06:21.104120] [bt] pvfs2-cp [0x4492c6]
[E 09:06:21.104128] [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
[E 09:06:21.104136] [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
[0x43c054]
[E 09:06:21.104143] [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
[E 09:06:21.104151] [bt]
pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
[E 09:06:21.104158] [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
[E 09:06:21.104165] [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
[E 09:06:21.104173] [bt] pvfs2-cp(main+0x372) [0x40d792]
[E 09:06:21.104181] [bt] /lib/libc.so.6(__libc_start_main+0xda)
[0x2aaaab0784ca]
[E 09:09:46.596001] Error: ib_check_cq: entry id 0x5c4cc0 opcode RECV
error IBV_WC_WR_FLUSH_ERR.
[E 09:09:47.109736] [bt] pvfs2-cp(error+0xca) [0x44a1ca]
[E 09:09:47.109790] [bt] pvfs2-cp [0x448dc3]
[E 09:09:47.109799] [bt] pvfs2-cp [0x4492c6]
[E 09:09:47.109807] [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
[E 09:09:47.109816] [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
[0x43c054]
[E 09:09:47.109823] [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
[E 09:09:47.109831] [bt]
pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
[E 09:09:47.109840] [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
[E 09:09:47.109847] [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
[E 09:09:47.109856] [bt] pvfs2-cp(main+0x372) [0x40d792]
[E 09:09:47.109863] [bt] /lib/libc.so.6(__libc_start_main+0xda)
[0x2aaaab0784ca]
The other interesting thing to know is if you can recofigure PVFS to
use only TCP, then run your bonnie test and get the same error.
Except for IB testing, I've had TCP specified in the pvfs2tab and mount
options and haven't been able to disrupt it; is that sufficient or
should I remove all references to IB? I repeated the pvfs2-cp using TCP
and didn't receive any errors.
Tad
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
!DSPAM:460006b4105153366512726!
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users