Re: [freenet-dev] Fatal timeouts was: Roadmap to 0.7.0

Robert Hailey Fri, 18 Jan 2008 12:50:47 -0800


On Jan 18, 2008, at 12:55 PM, Matthew Toseland wrote:

I think that the fatal timeout problem should be nailed down before0.7.


From my investigation I think that it is caused by a problem in the
link layer (Message->packet->Message) where Messages are either
dropped or duplicated (albeit, somewhat rarely).


Do you have any more detail?

Definitionally a fatal timeout is when the last node receives the Ackthat the other node is working on it, but nothing else (except maybenon-local RejectOverloads).

So that we received an ack, factually all we *know* is that a requesthandler began to run on the remote machine, but we did not get aresponding MessageItem...In theory these are the only possible problems that would cause thisin an environment where all the nodes are the same (simulator):

(1) RequestHandler & Sender don't match each other.

(2) The node finished, but the response was dropped between theirRequestHandler and our MessageCore.(3) The node actually did take > 2 minutes to process the request(e.g. fetching from datastore, verify, forwarding queue, coalescingdelays, etc).

Although I have just recently been looking at the HTLs w/respect tofatal timeouts, I find it super-odd that (at least for this, mysimulator setup) they always occur at htl==5 (?!?!). How totallybizarre! HTLs do not start at 5, so they are surely being decremented.It must be that the request stays at 10 to a point, back tracks to 5,and then fatal-timeout's.

As best I can tell, all the fatal timeouts in the simulator occur withdata which does not exist so it may end up being a whole-networksearch (a bunch of USK lookups).

Certainly if any control packet is dropped in the chain a timeout iscertain, and I have gotten a few accepted timeouts in the simulator,but they are much more rare.


--Impossible message #1--

Jan 18, 2008 19:43:22:148 (freenet.node.PeerNode$SyncMessageCallback,UDP packet receiver for Darknet port 5008, ERROR): Acknowledged butnot sent?! on [EMAIL PROTECTED] forfreenet[EMAIL PROTECTED]@90.1.1.242:[EMAIL PROTECTED]@12ca580


* To me this indicates a bug in the message system.

* If we send a Message we don't think that we have sent, this mightmake all the sendSync()'s contributers to cause #3.* It does not appear consistently with fatal timeouts, so at best itis a symptom, not a cause.


--Impossible message #2--

Jan 18, 2008 20:21:39:517 (freenet.node.RequestSender, RequestSenderfor UID -8258496701911223604, ERROR): request fatal-timeout (null)htl#5 after accept (1 messages; last=FNPRejectOverload)Jan 18, 2008 20:21:40:926 (freenet.node.RequestSender, RequestSenderfor UID -8258496701911223604, ERROR): request fatal-timeout (null)htl#5 after accept (1 messages; last=FNPRejectOverload)Jan 18, 2008 20:21:40:927 (freenet.node.RequestHandler, RequestHandlerfor UID -8258496701911223604, ERROR): requestsender took too long torespond to requestor (2m1s/6)Jan 18, 2008 20:21:41:030 (freenet.node.RequestSender, RequestSenderfor UID -8258496701911223604, ERROR): request fatal-timeout (null)htl#5 after accept (1 messages; last=FNPRejectOverload)Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandlerfor UID -8258496701911223604, ERROR): requestsender took too long torespond to requestor (2m1s/6)Jan 18, 2008 20:21:41:031 (freenet.node.RequestHandler, RequestHandlerfor UID -3246664023200145547, ERROR): requestsender took too long torespond to requestor (2m/6)Jan 18, 2008 20:21:46:211 (freenet.node.RequestHandler, RequestHandlerfor UID -8258496701911223604, ERROR): requestsender took too long torespond to requestor (2m6s/1)

* I would expect that if one node FATAL_TIMEOUTs, the whole chainshould stop (and all return TIMEOUT "/6", but the last one would havereturned RNF "/1"). I suppose this means that some requestsenderwaited *longer* than FETCH_TIMEOUT, and a request handler translated atimeout to a RejectOverload, and then a node waiting even longer thanFETCH_TIMEOUT tried to continue searching, but running out of peersreturned RNF (???) [a whole 6 seconds more too].* I would expect that fatal timeouts are cascading and thereforeshould see: fatal-timeout htl#5, htl#6, htl#7... etc. Not all #5.

--

Dropped Messages makes sense, as the occasional AcceptedTimeout (whichin a simulator should also be impossible).

If the issue is duplicated packets, then it would have to be coupledwith poor mutual exclusion on request senders (two running with thesame UID, taking each others packets), at least a winnable racecondition.


That's all I've got so far.

--
Robert Hailey

_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Fatal timeouts was: Roadmap to 0.7.0

Reply via email to