Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Mon, 18 Sep 2006 08:08 -0500:
Doh, I thought I had a log included here, I guess not, here's the log, note that the second close connection message?
[..]
I'm not sure why we're getting two 'closing connection to' messages -- indeed this may be our problem now that I think of it, someone may be trying to close it twice -- but I can tell you that I always see two of these messages??

I'm also still seeing weird error codes, and am not sure if we've addressed this yet, but doubt thats our real problem.

Thanks for the log.  It did point exactly to the problem.

I was just looking at this last night and found the (rather silly)
bug and checked in a fix.  So at least my test case to crash the
server does not happen anymore.  I will continue to try to induce other
crashes by killing one side or the other.

If you can help me find a way to cause the hang you're getting with
NPpvfs, I'd like to get that resolved.  Or give me a primer offline
if you think I should take a look at it on your machines.

                -- Pete



Pete -
I ran the modified netpipe tests which you sent back to me from last week, and this is what i have from the log of the MD server, the other servers dont seem to have any recollection of this issue. And yes, it locked up over here for some reason. Also, thanks to your fix with the close_connection problem, my MD server, and the other servers for that matter, no longer crash on these tests! We're still hangin on something here though, I ran on regular 4x links here for this test, same setup as you have mentioned in the other thread, 6servers, 1md, 1 client using 4x link.

Also, for some reason or another, the servers arent spinning @ 100% anymore after these tests. I'll look into getting you an account, or making sure your old account is still valid here this afternoon.


[D 10:32:45.319237] ib_check_cq: send to 10.1.5.218:54332 completed locally.
[D 10:32:45.319245] dbpf_bstream_rw_list: mem_offset: 0x63afd0, mem_size: 99
[D 10:32:45.319257] dbpf_bstream_rw_list: stream_offset: 0, stream_size: 99
[D 10:32:45.319267] DBPF I/O ops in progress: 1
[D 10:32:45.319276] lio_listio called with the following aiocbs:
[D 10:32:45.319290] aiocb_ptr_array[0]: fd: 13, off: 0, bytes: 99, buf: 0x63afd0
, type: 0
[D 10:32:45.319301] issue_or_delay_io_operation: lio_listio posted 0x63d8b0 (han
dle 1840700272, ret 0)
[D 10:32:45.319336] --- aio_progress_notification called with handle 1840700272
(0x63d8b0)
[D 10:32:45.319346] aio_progress_notification: READ complete: aio_return() says
99 [fd = 13]
[D 10:32:45.319355] *** starting delayed ops if any (state is LIST_PROC_ALLPOSTE
D)
[D 10:32:45.319387] DBPF I/O ops in progress: 0
[D 10:32:45.319415] BMI_post_send_list: addr: 6246, count: 1, total_size: 139, t
ag: 32780
[D 10:32:45.319424]    element 0: offset: 0x64ada0, size: 139
[D 10:32:45.319433] BMI_ib_post_send_list: listlen 1 tag 32780.
[E 10:33:15.509651] job_time_mgr_expire: job time out: cancelling bmi operation,
job_id: 235669.
[D 10:33:15.522650] BMI_cancel: cancel id 235670
[D 10:33:15.522731] test_sq: sq 0x644670 cancelled.
[D 10:33:15.522779] BMI_testcontext completing: 235670

   -- Kyle

--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to