I just wanted to move a discussion we had on IRC to the list by putting
out what I have now come to believe about the problem after putting it
through the Oskar filter.

We have problems with the way 0.4 nodes behave when they become
overloaded - they tend to become unreachable to new connections and slow
down to a crawl, but without this having the effect of lowering the
load. After thinking about it, I think this is because the only way that
the node responds to high load being completely out of line with the
0.4 protocol.

The current system is pretty much the same as the old one - when a node
becomes overloaded (which almost almost manifests itself in an absolute
way by the threadpool filling up) it stops accepting incoming
connections. This worked decently in 0.3 and below, because then every
new query would arrive on a new connection, and replies would almost
always come over existing ones. This correlation between the
tranport/session handeling (the connections) and the Application level
protocol meant that the actual effect of the blocking of a transport
layer connection was more or less equivalent to rejecting further
requests.

With 0.4 we removed this correlation between the layers, messages are
now sent between nodes on whatever open connections are available or
made, and queries are just likely to arrive on existing connections
(incoming or outgoing) as new incoming ones. We didn't, however,
implement any new overload handeling beyond the rejection of new
incoming connections.

It became clear quite soon that peoples 0.4 nodes were keeping all their
connections open all the time. I took this to assume that algorithms for
when idle connections were being closed needed to be tuned - that most
of the connections locking up threads were simply redundant and rarely
used connections. However, much stricter rules and the addition of
connection pruning (which kills off all connections that are even
slightly old when the node runs out threads) hasn't seemed to help at
all.

I am now coming to a different conclusion - the problem is not unused
connections but used ones. The combination of datastore problems leading
to very few stable nodes, and the widespread use of God aweful network
flooding applications like "Frost", has resulted in the nodes from which
I have seen logs and stats to very actually be overloaded - and because
the correlation between new connections and new queries nolonger exists,
our transport/session layer response to overloading has very little
effect on the Application layer problem of query overloading.

I now believe that the correct approach to this problem should be:

- When the load on the node starts approaching the limit of what it can
take, it should start rejecting queries at the Application layer by
replying to (Data/Insert/Announcement)Request messages with
QueryRejected. This should have the same effect that rejecting incoming
connections had in 0.3.

- We want to avoid rejecting incoming connections whenever possible,
because in 0.4 this leads us fail to get replies as often it leads us to
accept less queries. Thus, if we get close to running out of threads
entirely, we should start dropping existing connections to make way for
new ones. I think now that the whole pruning algorithm is unenecessary,
this should simply be done by a one for one system. It seems logical to
me to that the connection that we should drop should the least recently
used one, but GJ believes this is bad through some arguments about
attackers who cannot mount real attacks that make no sense to me, and if
it matters to others then a random currently idle connection probably
works as well.

It should be noted that this second behavior will have absolutely no
detrimental effect on the load - that is entirely moved into the
Application layer where it belongs.

Beyond this lies the use of real load balancing in the protocol by
having nodes weigh the probability of reseting the DataSource against
their relative load on the network - but I think we can leave that for
another day.


-- 

Oskar Sandberg
oskar at freenetproject.org

_______________________________________________
Devl mailing list
Devl at freenetproject.org
http://lists.freenetproject.org/mailman/listinfo/devl

Reply via email to