[Pvfs2-developers] Re: segfault in openib code

Kyle Schochenmaier Mon, 18 Sep 2006 08:43:27 -0700

Pete Wyckoff wrote:

[EMAIL PROTECTED] wrote on Mon, 18 Sep 2006 08:08 -0500:
Doh, I thought I had a log included here, I guess not, here's the log,note that the second close connection message?
[..]
I'm not sure why we're getting two 'closing connection to' messages --indeed this may be our problem now that I think of it, someone may betrying to close it twice -- but I can tell you that I always see two ofthese messages??
I'm also still seeing weird error codes, and am not sure if we'veaddressed this yet, but doubt thats our real problem.
Thanks for the log.  It did point exactly to the problem.

I was just looking at this last night and found the (rather silly)
bug and checked in a fix.  So at least my test case to crash the
server does not happen anymore.  I will continue to try to induce other
crashes by killing one side or the other.

If you can help me find a way to cause the hang you're getting with
NPpvfs, I'd like to get that resolved.  Or give me a primer offline
if you think I should take a look at it on your machines.

                -- Pete

Pete -

I ran the modified netpipe tests which you sent back to me from lastweek, and this is what i have from the log of the MD server, the otherservers dont seem to have any recollection of this issue. And yes, itlocked up over here for some reason. Also, thanks to your fix with theclose_connection problem, my MD server, and the other servers for thatmatter, no longer crash on these tests! We're still hangin on somethinghere though, I ran on regular 4x links here for this test, same setup asyou have mentioned in the other thread, 6servers, 1md, 1 client using 4xlink.

Also, for some reason or another, the servers arent spinning @ 100%anymore after these tests.I'll look into getting you an account, or making sure your old accountis still valid here this afternoon.



[D 10:32:45.319237] ib_check_cq: send to 10.1.5.218:54332 completed locally.
[D 10:32:45.319245] dbpf_bstream_rw_list: mem_offset: 0x63afd0, mem_size: 99
[D 10:32:45.319257] dbpf_bstream_rw_list: stream_offset: 0, stream_size: 99
[D 10:32:45.319267] DBPF I/O ops in progress: 1
[D 10:32:45.319276] lio_listio called with the following aiocbs:

[D 10:32:45.319290] aiocb_ptr_array[0]: fd: 13, off: 0, bytes: 99, buf:0x63afd0

, type: 0

[D 10:32:45.319301] issue_or_delay_io_operation: lio_listio posted0x63d8b0 (han

dle 1840700272, ret 0)

[D 10:32:45.319336] --- aio_progress_notification called with handle1840700272

(0x63d8b0)

[D 10:32:45.319346] aio_progress_notification: READ complete:aio_return() says

99 [fd = 13]

[D 10:32:45.319355] *** starting delayed ops if any (state isLIST_PROC_ALLPOSTE

D)
[D 10:32:45.319387] DBPF I/O ops in progress: 0

[D 10:32:45.319415] BMI_post_send_list: addr: 6246, count: 1,total_size: 139, t

ag: 32780
[D 10:32:45.319424]    element 0: offset: 0x64ada0, size: 139
[D 10:32:45.319433] BMI_ib_post_send_list: listlen 1 tag 32780.

[E 10:33:15.509651] job_time_mgr_expire: job time out: cancelling bmioperation,

job_id: 235669.
[D 10:33:15.522650] BMI_cancel: cancel id 235670
[D 10:33:15.522731] test_sq: sq 0x644670 cancelled.
[D 10:33:15.522779] BMI_testcontext completing: 235670

   -- Kyle

--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy

Scalable Computing Laboratory

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

[Pvfs2-developers] Re: segfault in openib code

Reply via email to