Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
On Thursday 01 Sep 2011 15:29:35 Evan Daniel wrote: On Thu, Sep 1, 2011 at 7:24 AM, Matthew Toseland t...@amphibian.dyndns.org wrote: I like this proposal :) Is the documentation on the math of how to get the random routing to behave well sufficient? Let me know if it isn't. The MHMC routing math shouldn't be too complicated, but we want to be certain it's implemented correctly so that the data is sound. Do you have a metric for how clustered vs uniform a node's peers are? Maybe. It's a tougher problem than it looks at first glance, unless we have a somewhat reliable network size estimate available. I'll give it some more thought. If you want a qualitative, visual estimate, just change the peer location distribution graph on the stats page to have a logarithmic x axis. Basically, a node should have similar numbers of peers at distance 0.25 d = 0.5, and at 0.125 d = 0.25, etc. That is, bar n should graph the number of nodes at distance 2^(-n-2) d 2^(-n-1). That doesn't provide an answer you can use in code to evaluate and make decisions, but it should give better anecdotal evidence about problems. So you/operhiem1 plan to do something with this now we have probe requests? What's MHMC? Metropolis-Hastings Monte Carlo. We currently use it to get the right probability distribution for location swaps. Do we? I thought the walks were purely random, and the probability of swapping was based solely on a comparison between the distances before and after? We should also use it for randomly routed requests. Indeed, this is now implemented. (For ones that need to be statistically random, anyway. It's probably not worth the performance cost for eg high-htl random routing of regular requests for security purposes.) It's mentioned in the bug report: https://bugs.freenetproject.org/view.php?id=3568 And there's a code snippet on my flog, that's used in my simulator: freenet:USK@gjw6StjZOZ4OAG-pqOxIp5Nk11udQZOrozD4jld42Ac,BYyqgAtc9p0JGbJ~18XU6mtO9ChnBZdf~ttCn48FV7s,AQACAAE/flog/29/200909.xhtml (Entry for 20090926) Evan Daniel signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org https://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
On Thu, Aug 2, 2012 at 7:18 AM, Matthew Toseland t...@amphibian.dyndns.org wrote: On Thursday 01 Sep 2011 15:29:35 Evan Daniel wrote: On Thu, Sep 1, 2011 at 7:24 AM, Matthew Toseland t...@amphibian.dyndns.org wrote: I like this proposal :) Is the documentation on the math of how to get the random routing to behave well sufficient? Let me know if it isn't. The MHMC routing math shouldn't be too complicated, but we want to be certain it's implemented correctly so that the data is sound. Do you have a metric for how clustered vs uniform a node's peers are? Maybe. It's a tougher problem than it looks at first glance, unless we have a somewhat reliable network size estimate available. I'll give it some more thought. If you want a qualitative, visual estimate, just change the peer location distribution graph on the stats page to have a logarithmic x axis. Basically, a node should have similar numbers of peers at distance 0.25 d = 0.5, and at 0.125 d = 0.25, etc. That is, bar n should graph the number of nodes at distance 2^(-n-2) d 2^(-n-1). That doesn't provide an answer you can use in code to evaluate and make decisions, but it should give better anecdotal evidence about problems. So you/operhiem1 plan to do something with this now we have probe requests? Yes, though probably not immediately. What's MHMC? Metropolis-Hastings Monte Carlo. We currently use it to get the right probability distribution for location swaps. Do we? I thought the walks were purely random, and the probability of swapping was based solely on a comparison between the distances before and after? Absolutely. It's the mathematical technique used to prove that both the path folding algorithm of opennet and the location swapping algorithm of darknet will produce the desired results. The fact that the formulas used in those proofs don't appear in the Freenet source code is irrelevant, except that it contributes to Freenet code being difficult to understand. Evan Daniel ___ Devl mailing list Devl@freenetproject.org https://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
Hi, At Sat, 03 Sep 2011 00:53:58 +0200, Arne Babenhauserheide wrote: Am Freitag, 2. September 2011, 23:34:29 schrieb Matthew Toseland: If the load balancer does not have some hidden delicacies, there is a very simple check to see if my understanding is right. No, it takes into account that SSKs use very little bandwidth (1-2K versus 32K). Bandwidth: 75 down, 85 up. I changed the upload bandwidth to 120 out (about my real output bandwidth), and it adjusted itself to 70 down and 78 up. This is about 10-20% more than with a 90kB/s limit, but far away from using my 120 (it only uses 65%). It is less then the overshooting one, though → load limiter is a bit too strong. I assume that due to treating CHKs and SSKs the same, the load limiting also increased the number of CHKs, though these really need the bandwidth. If we have a mean transfer time of about 45s-1min for a 32k block on my node (with OLM: failed CHK time - successful time, with NLM it is likely a bit more complex). Now the search time of about 1 minute in NLM for unsuccessful downloads (successful have 1:24 min for me) will block IO. If we assume that a successful transfer still takes 45s, almost half the time is spent searching without any transfer → wasted bandwidth. (Disclaimer: The following is just an idea. Treat any “should” as “it might be a good idea to” - which is much longer and harder to read, so I go with the should). I assume that the deeper issue than just failing requests is that waiting should not be accounted as bandwidth. A requests should be accounted as 32kB * estimated success probability, so failing requests are not counted (conceptually, the implementation is another thing). The limiter should then use a time-frame of about 2× the time to complete a request and try to fill that → Since NLM has higher wait times, it also needs a bigger bandwidth limiter window. If it can get to 3 minutes for bulk, the bandwidth limiter window needs to be 6 minutes. … one more reason (with the assumption of a 2min limiter window), why NLM used so little bandwidth: Estimated 2 min search times with 1 min transfer time meant that essentially 2/3rd of the allocated bandwidth was overbooking, because 2/3rd of the transfers would not finish within the window, so their bandwidth reservation was carried on into the next period. This reduced the total number of running requests which in turn increased the wait times (since less requests finished in a given time period). Conclusion: NLM changes some basic assumptions about the load limiting. Because of that we need parameter tweaking to integrate it cleanly. Likely that can only be done in a real network. We already know that the network *does not break down* with NLM, so live tweaking is possible. Likely it also changes the assumptions for other parts, so these will need some tweaking, too. Toad spent years to optimize the parameters and helper parts to work well with the assumptions from OLM. It’s natural that NLM — which has slightly different characteristics — requires some optimization outside of its main algorithms, too. Best wishes, Arne Besides: Bad performance: If the inserter uses NLM and the downloader uses OLM, I think that might make it harder for the downloader to get the data (different routing). Worse: If some downloaders use NLM and some use OLM the packets might take different paths (NLM has shorter paths), which is likely to affect the caching negatively (needs to cache at both paths). Besides 2: Regin has the theory that the thread scheduler was the bottleneck, because his threads always hit the limit of 500 threads. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
On Wed, Aug 31, 2011 at 8:00 AM, Matthew Toseland t...@amphibian.dyndns.org wrote: WE NEED MORE DATA. Well, my gut tells me that our existing scheme is likely too complicated to fix unless we are extremely fortuitous, however I'm happy to be wrong about that if others think that they have a good understanding of why we're having problems and how to fix them. So this is fine with me provided that its not just data for data's sake, but *actionable* data. By this I mean that the data we collect should give us clear information about what is and isn't working, and hopefully tell us how to fix it. But a bunch of metrics we have no idea how to interpret won't get us anywhere. Ian. -- Ian Clarke Founder, The Freenet Project Email: i...@freenetproject.org ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
Am Freitag, 2. September 2011, 12:20:02 schrieb Ian Clarke: On Wed, Aug 31, 2011 at 8:00 AM, Matthew Toseland t...@amphibian.dyndns.org wrote: WE NEED MORE DATA. Well, my gut tells me that our existing scheme is likely too complicated to fix unless we are extremely fortuitous, however I'm happy to be wrong about that if others think that they have a good understanding of why we're having problems and how to fix them. If the load balancer does not have some hidden delicacies, there is a very simple check to see if my understanding is right. Since SSKs are mostly unsuccessfull and are about 50% of the requests, the bandwidth limiter essentially targets 50% of the bandwidth. Setting my bandwidth to about 150% of my actual bandwidth should make it guess my bandwidth more correctly, leaving 25% free for bursting¹. Currently the mean bandwidth with NLM and AIMDs for me is about 50 kB/s on a setting of 90kB/s outgoing. My line can handle about 120kB/s outgoing. So I set the bandwidth setting to 180kB/s. If I am right, Freenet should then consume about 90kB/s on average. If it stays at 50-60, that’s likely a limitation of my peers → no useful data → test would have to be done on a slower line or with more peers. If it goes down or I get very many timeouts, then I‘m likely wrong. It would be nice if some other people could replicate that. Note: I just disabled my testnet node to avoid skewing the data. Best wishes, Arne signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
On Friday 02 Sep 2011 18:20:02 Ian Clarke wrote: On Wed, Aug 31, 2011 at 8:00 AM, Matthew Toseland t...@amphibian.dyndns.org wrote: WE NEED MORE DATA. Well, my gut tells me that our existing scheme is likely too complicated to fix unless we are extremely fortuitous, however I'm happy to be wrong about that if others think that they have a good understanding of why we're having problems and how to fix them. So this is fine with me provided that its not just data for data's sake, but *actionable* data. By this I mean that the data we collect should give us clear information about what is and isn't working, and hopefully tell us how to fix it. As a matter of principle, empiricism requires experimental data. But it's true that sometimes the metrics which are easiest to gather, and most closely related to our expected usage (i.e. good benchmarks), are hardest to theorise about. But a bunch of metrics we have no idea how to interpret won't get us anywhere. Well, the following appear to be justifiable: - We can detect whether the theory about network topology being messed up by local requests is correct. This is directly actionable: If a large proportion of nodes have this problem, we could try the no path folding above htl 16 rule. Of course, this might slow down local requests, and might even make it harder to bootstrap new nodes ... - Throughput data ... I guess some of our existing tests probably give us this, e.g. the daily timings for insert/retrieve, which lately succeed but take rather a long time. I guess we probably have enough here, although arguably we should have a big-file inter-day test (current tests are small files for multiple days or big files for one day). - Build to build comparisons: If we are going to deploy changes to the network we need to be able to compare one build to another. Otherwise, no matter how good our simulations (and odds are they won't be very good, getting the right level of accuracy is hard), we won't know whether it works in practice. signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
On Friday 02 Sep 2011 19:31:14 Arne Babenhauserheide wrote: Am Freitag, 2. September 2011, 12:20:02 schrieb Ian Clarke: On Wed, Aug 31, 2011 at 8:00 AM, Matthew Toseland t...@amphibian.dyndns.org wrote: WE NEED MORE DATA. Well, my gut tells me that our existing scheme is likely too complicated to fix unless we are extremely fortuitous, however I'm happy to be wrong about that if others think that they have a good understanding of why we're having problems and how to fix them. If the load balancer does not have some hidden delicacies, there is a very simple check to see if my understanding is right. Since SSKs are mostly unsuccessfull and are about 50% of the requests, the bandwidth limiter essentially targets 50% of the bandwidth. No, it takes into account that SSKs use very little bandwidth (1-2K versus 32K). signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
On Friday 02 Sep 2011 18:20:02 Ian Clarke wrote: On Wed, Aug 31, 2011 at 8:00 AM, Matthew Toseland t...@amphibian.dyndns.org wrote: WE NEED MORE DATA. Well, my gut tells me that our existing scheme is likely too complicated to fix unless we are extremely fortuitous, however I'm happy to be wrong about that if others think that they have a good understanding of why we're having problems and how to fix them. And my gut tells me that scrapping 6 months work sucks, especially when the current performance appears to be well below what it was when I started to deploy NLM, so we should try to salvage it, especially as anything new will probably take another 6 months to argue about, design, simulate, tweak and generally get right. Funny how gut feelings can be so subjective! ;) So this is fine with me provided that its not just data for data's sake, but *actionable* data. By this I mean that the data we collect should give us clear information about what is and isn't working, and hopefully tell us how to fix it. But a bunch of metrics we have no idea how to interpret won't get us anywhere. signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
Am Freitag, 2. September 2011, 23:34:29 schrieb Matthew Toseland: If the load balancer does not have some hidden delicacies, there is a very simple check to see if my understanding is right. Since SSKs are mostly unsuccessfull and are about 50% of the requests, the bandwidth limiter essentially targets 50% of the bandwidth. No, it takes into account that SSKs use very little bandwidth (1-2K versus 32K). That might explain the data I see: Bandwidth: 75 down, 85 up. I expected 90 kB/s. Theory: My claimed bandwidth doubling seems to have overshot, which gave too many timeouts (that is also supported by the volatility of the bandwidth: oszillating rapidly between 50 and 100 - this is with AIMDs on). Best wishes, Arne signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
On Wednesday 31 Aug 2011 15:05:51 Evan Daniel wrote: On Wed, Aug 31, 2011 at 9:00 AM, Matthew Toseland t...@amphibian.dyndns.org wrote: On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: Ok, thinking about it, here is a proposal, or rather, the beginning of a proposal. I'm assuming that we get rid of NLM, fair sharing, and anything else intended to control load, and replace it with this. We will absolutely need to simulate this before we write a single line of code to deploy it. The core idea is that a node will include a floating point number in response to any kind of request showing how close that node is to being overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is completely saturated and must reject requests. Clearly the goal is to avoid getting anywhere near to 1.0. A node tracks several things: - What is the overall average load reported by responses this node has received - What is the average load reported by responses this node has received, per remote node - What is the average load reported by responses this node forwarded, per remote node I think, given these metrics, we should be able to do the following: - Limit our overall rate of initiating local requests based on the global average reported load - Limit our rate of local requests based on the average load of the connection to the peer it would need to be forwarded to - Detect when remote peers are abusing the system by disregarding load - as evidenced by a significantly higher average load in replies forwarded to them Of course, lots of questions: - How do we compute the averages? A decaying running average of some kind? What time period? - How do we translate load into a desired rate of requests? - What criteria indicate that a peer is abusing the system? What is the remedy? This is basically control theory stuff, and we'll need robust answers to these questions before we proceed (ideally with a theoretical rather than experimental foundation). One fundamental issue that hasn't been covered in detail yet: WE NEED MORE DATA. How can we achieve this? First, digger3's graphs are very helpful. But he can only very rarely show statistically significant results between one build and another. We need a lot more people gathering data for this. We can either just ask people, or we can build a plugin that makes it easy (there is some manual setup at the moment). Second, digger3's graphs don't address throughput. This is THE reason why NLM was turned off: throughput was dramatically down. digger3's graphs are based on the LongTermManySingleBlocksTest, which is based on inserting a bunch of single blocks. It's true that each is timed but that is irrelevant: What we need is a big insert, where other factors affecting throughput can come into play. Third, there are those who say our problems are not related to load management at all, they are related to the fact that Freenet nodes with a lot of downloads tend to have a flat distribution of links. This is a natural enough optimisation for their local usage, but if it happens across the network, or even if it happens in clusters, it will wreak havoc. We need some way to estimate how widespread this is. If it is common, we should do something about it, for instance, not path folding until requests reach HTL 16 (which is a good idea for security, but if this problem is rare and not clustered it would reduce performance). The obvious option is to implement the random routed probes that Evan suggested. These should eventually be able to replace the current probe requests, and would return a lot less data and so be safer. They would be routed randomly and then terminate on a node, and return a single value from that node - in this case some metric of how spread out vs specialised its peers are. Other versions would be able to estimate the network size by a tag-and-release method that should be cheaper than the current probe request based stuff (it would return an estimate of uptime and a unique ID for the node which is kept private and used solely for that purpose), or return the proportion of peers backed off. Another variant is to random route and then do a request - this would allow us to determine whether content is still retrievable and thus whether we want to reinsert it, or might provide more representative statistics than bootstrapping a new node every time (at least allowing us to separate problems with bootstrapping from problems faced by the average node). IMHO some or all of the above are worth seriously considering before deploying any new scheme. If we want to be empirical we need to measure the effect of our changes on the real network, not only in simulation. I like this proposal :)
Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
On Thu, Sep 1, 2011 at 7:24 AM, Matthew Toseland t...@amphibian.dyndns.org wrote: I like this proposal :) Is the documentation on the math of how to get the random routing to behave well sufficient? Let me know if it isn't. The MHMC routing math shouldn't be too complicated, but we want to be certain it's implemented correctly so that the data is sound. Do you have a metric for how clustered vs uniform a node's peers are? Maybe. It's a tougher problem than it looks at first glance, unless we have a somewhat reliable network size estimate available. I'll give it some more thought. If you want a qualitative, visual estimate, just change the peer location distribution graph on the stats page to have a logarithmic x axis. Basically, a node should have similar numbers of peers at distance 0.25 d = 0.5, and at 0.125 d = 0.25, etc. That is, bar n should graph the number of nodes at distance 2^(-n-2) d 2^(-n-1). That doesn't provide an answer you can use in code to evaluate and make decisions, but it should give better anecdotal evidence about problems. What's MHMC? Metropolis-Hastings Monte Carlo. We currently use it to get the right probability distribution for location swaps. We should also use it for randomly routed requests. (For ones that need to be statistically random, anyway. It's probably not worth the performance cost for eg high-htl random routing of regular requests for security purposes.) It's mentioned in the bug report: https://bugs.freenetproject.org/view.php?id=3568 And there's a code snippet on my flog, that's used in my simulator: freenet:USK@gjw6StjZOZ4OAG-pqOxIp5Nk11udQZOrozD4jld42Ac,BYyqgAtc9p0JGbJ~18XU6mtO9ChnBZdf~ttCn48FV7s,AQACAAE/flog/29/200909.xhtml (Entry for 20090926) Evan Daniel ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] How to gather more data was Re: Beyond New Load Management: A proposal
On Wed, Aug 31, 2011 at 9:00 AM, Matthew Toseland t...@amphibian.dyndns.org wrote: On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: Ok, thinking about it, here is a proposal, or rather, the beginning of a proposal. I'm assuming that we get rid of NLM, fair sharing, and anything else intended to control load, and replace it with this. We will absolutely need to simulate this before we write a single line of code to deploy it. The core idea is that a node will include a floating point number in response to any kind of request showing how close that node is to being overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is completely saturated and must reject requests. Clearly the goal is to avoid getting anywhere near to 1.0. A node tracks several things: - What is the overall average load reported by responses this node has received - What is the average load reported by responses this node has received, per remote node - What is the average load reported by responses this node forwarded, per remote node I think, given these metrics, we should be able to do the following: - Limit our overall rate of initiating local requests based on the global average reported load - Limit our rate of local requests based on the average load of the connection to the peer it would need to be forwarded to - Detect when remote peers are abusing the system by disregarding load - as evidenced by a significantly higher average load in replies forwarded to them Of course, lots of questions: - How do we compute the averages? A decaying running average of some kind? What time period? - How do we translate load into a desired rate of requests? - What criteria indicate that a peer is abusing the system? What is the remedy? This is basically control theory stuff, and we'll need robust answers to these questions before we proceed (ideally with a theoretical rather than experimental foundation). One fundamental issue that hasn't been covered in detail yet: WE NEED MORE DATA. How can we achieve this? First, digger3's graphs are very helpful. But he can only very rarely show statistically significant results between one build and another. We need a lot more people gathering data for this. We can either just ask people, or we can build a plugin that makes it easy (there is some manual setup at the moment). Second, digger3's graphs don't address throughput. This is THE reason why NLM was turned off: throughput was dramatically down. digger3's graphs are based on the LongTermManySingleBlocksTest, which is based on inserting a bunch of single blocks. It's true that each is timed but that is irrelevant: What we need is a big insert, where other factors affecting throughput can come into play. Third, there are those who say our problems are not related to load management at all, they are related to the fact that Freenet nodes with a lot of downloads tend to have a flat distribution of links. This is a natural enough optimisation for their local usage, but if it happens across the network, or even if it happens in clusters, it will wreak havoc. We need some way to estimate how widespread this is. If it is common, we should do something about it, for instance, not path folding until requests reach HTL 16 (which is a good idea for security, but if this problem is rare and not clustered it would reduce performance). The obvious option is to implement the random routed probes that Evan suggested. These should eventually be able to replace the current probe requests, and would return a lot less data and so be safer. They would be routed randomly and then terminate on a node, and return a single value from that node - in this case some metric of how spread out vs specialised its peers are. Other versions would be able to estimate the network size by a tag-and-release method that should be cheaper than the current probe request based stuff (it would return an estimate of uptime and a unique ID for the node which is kept private and used solely for that purpose), or return the proportion of peers backed off. Another variant is to random route and then do a request - this would allow us to determine whether content is still retrievable and thus whether we want to reinsert it, or might provide more representative statistics than bootstrapping a new node every time (at least allowing us to separate problems with bootstrapping from problems faced by the average node). IMHO some or all of the above are worth seriously considering before deploying any new scheme. If we want to be empirical we need to measure the effect of our changes on the real network, not only in simulation. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl I like this proposal :) Is the