Pete -
We've been playing with various FlowBufferSizes on the servers as well
as varying the stripe_size when opening/modifying files for our
benchmarking. I ran across this error that caused our tests to hang,
not sure where this should go, but it's reproducable with our setup. I
was wondering if anyone could tell me if there's an obvious problem with
this setup:
(using mellanox ddr card)
FlowBufferSize 16MB
stripe_size 256KB
6 data servers
This problem occurs whenever we pick 256KB as a stripe size, however, it
doesnt show up @ 64KB, or 1M or more (testing 512K right now to see if
it occurs). We also noticed that in general using 256KB stripes causes
weird things, like eHCA errors which bring down the server completely...
(this shows up in the logs of all servers)
[E 10:09:51.709387] Warning: openib_check_async_events:
IBV_EVENT_QP_ACCESS_ERR.
[E 10:10:22.026509] job_time_mgr_expire: job time out: cancelling flow
operation
, job_id: 2028006.
[E 10:10:22.026567] fp_multiqueue_cancel: flow proto cancel called on
0x2aaaabc2
1d20
[E 10:10:22.026578] handle_io_error: flow proto error cleanup started on
0x2aaaa
bc21d20, error_code: -1610612737
[E 10:10:22.035861] handle_io_error: flow proto 0x2aaaabc21d20 canceled
1 operat
ions, will clean up.
[E 10:10:22.036661] handle_io_error: flow proto 0x2aaaabc21d20 error
cleanup fin
ished, error_code: -1610612737
dmesg returns this: (on every server)
ib_mthca 0000:01:00.0: modify QP 3->4 returned status 10.
ib_mthca 0000:01:00.0: modify QP 3->4 returned status 10.
ib_mthca 0000:01:00.0: modify QP 3->4 returned status 10.
ib_mthca 0000:01:00.0: modify QP 3->4 returned status 10.
ib_mthca 0000:01:00.0: modify QP 3->4 returned status 10.
Any ideas?
thanks
--Kyle
--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers