On Jan 18, 2008, at 12:55 PM, Matthew Toseland wrote:
I think that the fatal timeout problem should be nailed down before
0.7.
From my investigation I think that it is caused by a problem in the
link layer (Message->packet->Message) where Messages are either
dropped or duplicated (albeit, somewhat rarely).
Do you have any more detail?
Definitionally a fatal timeout is when the last node receives the Ack
that the other node is working on it, but nothing else (except maybe
non-local RejectOverloads).
So that we received an ack, factually all we *know* is that a request
handler began to run on the remote machine, but we did not get a
responding MessageItem...
In theory these are the only possible problems that would cause this
in an environment where all the nodes are the same (simulator):
(1) RequestHandler & Sender don't match each other.
(2) The node finished, but the response was dropped between their
RequestHandler and our MessageCore.
(3) The node actually did take > 2 minutes to process the request
(e.g. fetching from datastore, verify, forwarding queue, coalescing
delays, etc).
Although I have just recently been looking at the HTLs w/respect to
fatal timeouts, I find it super-odd that (at least for this, my
simulator setup) they always occur at htl==5 (?!?!). How totally
bizarre! HTLs do not start at 5, so they are surely being decremented.
It must be that the request stays at 10 to a point, back tracks to 5,
and then fatal-timeout's.
As best I can tell, all the fatal timeouts in the simulator occur with
data which does not exist so it may end up being a whole-network
search (a bunch of USK lookups).
Certainly if any control packet is dropped in the chain a timeout is
certain, and I have gotten a few accepted timeouts in the simulator,
but they are much more rare.
--Impossible message #1--
Jan 18, 2008 19:43:22:148 (freenet.node.PeerNode$SyncMessageCallback,
UDP packet receiver for Darknet port 5008, ERROR): Acknowledged but
not sent?! on [EMAIL PROTECTED] for
freenet
[EMAIL PROTECTED]@90.1.1.242:[EMAIL PROTECTED]@12ca580
* To me this indicates a bug in the message system.
* If we send a Message we don't think that we have sent, this might
make all the sendSync()'s contributers to cause #3.
* It does not appear consistently with fatal timeouts, so at best it
is a symptom, not a cause.
--Impossible message #2--
Jan 18, 2008 20:21:39:517 (freenet.node.RequestSender, RequestSender
for UID -8258496701911223604, ERROR): request fatal-timeout (null)
htl#5 after accept (1 messages; last=FNPRejectOverload)
Jan 18, 2008 20:21:40:926 (freenet.node.RequestSender, RequestSender
for UID -8258496701911223604, ERROR): request fatal-timeout (null)
htl#5 after accept (1 messages; last=FNPRejectOverload)
Jan 18, 2008 20:21:40:927 (freenet.node.RequestHandler, RequestHandler
for UID -8258496701911223604, ERROR): requestsender took too long to
respond to requestor (2m1s/6)
Jan 18, 2008 20:21:41:030 (freenet.node.RequestSender, RequestSender
for UID -8258496701911223604, ERROR): request fatal-timeout (null)
htl#5 after accept (1 messages; last=FNPRejectOverload)
Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandler
for UID -8258496701911223604, ERROR): requestsender took too long to
respond to requestor (2m1s/6)
Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandler
for UID -3246664023200145547, ERROR): requestsender took too long to
respond to requestor (2m/6)
Jan 18, 2008 20:21:46:211 (freenet.node.RequestHandler, RequestHandler
for UID -8258496701911223604, ERROR): requestsender took too long to
respond to requestor (2m6s/1)
* I would expect that if one node FATAL_TIMEOUTs, the whole chain
should stop (and all return TIMEOUT "/6", but the last one would have
returned RNF "/1"). I suppose this means that some requestsender
waited *longer* than FETCH_TIMEOUT, and a request handler translated a
timeout to a RejectOverload, and then a node waiting even longer than
FETCH_TIMEOUT tried to continue searching, but running out of peers
returned RNF (???) [a whole 6 seconds more too].
* I would expect that fatal timeouts are cascading and therefore
should see: fatal-timeout htl#5, htl#6, htl#7... etc. Not all #5.
--
Dropped Messages makes sense, as the occasional AcceptedTimeout (which
in a simulator should also be impossible).
If the issue is duplicated packets, then it would have to be coupled
with poor mutual exclusion on request senders (two running with the
same UID, taking each others packets), at least a winnable race
condition.
That's all I've got so far.
--
Robert Hailey
_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl