Pete Wyckoff wrote:

[EMAIL PROTECTED] wrote on Thu, 02 Mar 2006 18:28 -0600:
[EMAIL PROTECTED]:~/pvfs2$ inst1.4/bin/pvfs2-cp -t /tmp/bigfile3 /mnt/pvfs2/
[E 17:58:03.909494] Error: encourage_send_incoming_cts: mop_id 1962 in CTS message not found.

No clue.  I too have seen something similar that looks like some
sort of a race, although with different errors in the CTS path.  Do
what you can to get a good debugging trace (PVFS2_DEBUGMASK=network)
on client and server and we figure it out.  I assume this happens on
the openib version too?  The bit of time I spent looking at it, I
found the problem goes away when debugging is on, how fun.  I'm not
sure what changed to cause this to happen, perhaps some optimization
higher up (like immediate return from sends?) is causing BMI_IB
functions to be called more quickly.

<snip>

I've gone through and used some of the debugging features and have seen some disturbingly positive results :) After enabling the debugging, 512MB transfers no longer fail over as they did previously.
However, larger, more realistic tests are now exhibiting the same problem.
I've attached the output of a 6.4GByte run which failed, and quasi-calculated that it failed after transmitting about 650MBytes. (this was assumed to be the case by taking a rough 90% of the 2800 completions in this run to have transmitted 256KB, though my estimation may have been pointless and/or plain wrong considering the problem at hand) This type of test is 100% reproducible on my end, varrying between runs with only the failing mopid changing.
The test is:
'pvfs2-cp -t [6.4GB file on local array] [pvfs2-fs on remote server mounted to 2TB array]'

When going through the output, I noticed that in several cases the completions generated for BMI_context's mopid's would be off by 5-10. This seemed to fix itself and/or be handled by the server without issues, however, towards the end I noticed that there is a case when the id's would get off by 35+ and the transfer would fail with a cts error.

I'm not entirely sure why this could/would be the case where a larger offset of generated completions for id's would break the server, but I have a distinct feeling that this may be the cause of it.

Any ideas?

thanks,

   - Kyle


--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory

Attachment: pvfs-cp-busted.log.gz
Description: GNU Zip compressed data

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to