[EMAIL PROTECTED] wrote on Mon, 18 Sep 2006 10:42 -0500:
> I ran the modified netpipe tests which you sent back to me from last
> week, and this is what i have from the log of the MD server, the other
> servers dont seem to have any recollection of this issue. And yes, it
> locked up over here for some reason. Also, thanks to your fix with the
> close_connection problem, my MD server, and the other servers for that
> matter, no longer crash on these tests! We're still hangin on something
> here though, I ran on regular 4x links here for this test, same setup as
> you have mentioned in the other thread, 6servers, 1md, 1 client using 4x
> link.
Some progress. :)
> Also, for some reason or another, the servers arent spinning @ 100%
> anymore after these tests.
> I'll look into getting you an account, or making sure your old account
> is still valid here this afternoon.
Just changed my password as demanded by some automatic mailer on
gateway last week.
> [D 10:32:45.319415] BMI_post_send_list: addr: 6246, count: 1, total_size:
> 139, tag: 32780
> [D 10:32:45.319424] element 0: offset: 0x64ada0, size: 139
> [D 10:32:45.319433] BMI_ib_post_send_list: listlen 1 tag 32780.
> [E 10:33:15.509651] job_time_mgr_expire: job time out: cancelling bmi
> operation, job_id: 235669.
The message isn't getting out of the box. The very next line that
should appear after BMI_ib_post_send_list for this level of
debugging would look something like:
[D 13:23:32.742349] ib_check_cq: send to 10.100.2.55:34272 completed locally.
That happens when the sending side calls BMI_testsome, which
filters down to a device probe. There are a few things that could
be wrong, but I can't convince myself that any are very likely:
1. ibv_post_send failed. Would have seen a return error message or
found an asynchronous error later.
2. Network died. The message should have timed out and generated an
error.
3. Host refused the message. Again, an error should have propagated
back.
4. BMI thread function never polled the device. Hard to imagine an
error that major.
You might edit src/io/bmi/bmi_ib/ib.h, near the end, and change
DEBUG_LEVEL to 4. That will generate one more debug message
immediately before the message hits the wire. But I can't imagine
it goes bad there.
Maybe run things in debuggers, with a breakpoint on
job_time_mgr_expire, and when you get there, look at the logs
and verify this is the same situation, then look at the sendq
and see if there's anything funny about the message:
p &ib_device->sendq
p ib_device->sendq
p *(ib_send_t *) ib_device->sendq.next
(assuming there's only one). Maybe try to poll the network manually
p ib_check_cq()
(see if the logs get another line, and if the return value is > 0.
-- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers