Re: Salut/avahi/meshview issues
On Thu, Jan 31, 2008 at 07:32:36AM -0500, Giannis Galanis wrote: I believe our current salut/avahi issues are described in the following points: 1. I was under the impression that when a peer switches channels it sends a goodbye signal. And in fact only anorthodoxically removed peers(after crashes/poweroffs by pressing the button etc) would delay to disappear from mesh views. The 10min TTL is not unreasonable, but it should only be used for a routine check. In fact peers that leave/arrive should inform the mesh instantly. In that case the 10min TLL will only affect only the mesh points with noisy links that their goodbye signals will get lost. And these connections are less priority anyway. Also we could send 2/3 goodbye signals to ensure delivery. I don't think avahi gets a chance to send goodbye packets. More specifically i don't think NM or other mechanism actually tell avahi: Oh we're going to leave the network, please say goodbye and then give it a chance to actually send the necessary goodbyes 2. We should definitely decrease the timeout window between a lost peer being detected, and the actual disappearance from the mesh view. This used to be 10min, now it is 20min, but really, to my experience, if a peer is for more than 1-2min away he aint coming back. In the code it's actually 12m + the time it takes avahi to conclude a node has gone. So this used to be around 14 minutes maximally, but with the upped TTL to 10 min it will be around 22 minutes. It might be interesting to see if with the latest patches the amount of false-negatives has gone down so much that we can remove the or at least decrease the slack time we add after a node has gone in avahi. 3. Should we make the above TTL and timeout to be user specific, or custom anyway?. Will there be a problem if two XOs have different TTL? I would assume that it wont. The idea is that it is a waste of our resources to try to calculate the ideal values of TTL and timeout by asking the collabora team to fix, and fix again. Whereas we can make the test here in 1cc, and find ourselves which suits as best. Is it easy to implement such a patch? 4. The 5501 bug(xmas tree effect). This is a very specific bug in the protocol, and i believe it will be sorted soon. This one is fixed right? 5. Why are avahi/salut/mesh view not communicating well? I hope we will have some answers on that as well. I'm not sure. If salut and the mesh view fail to communicate, the same problem should show up with gabble. Sjoerd -- Consider a spherical bear, in simple harmonic motion... -- Professor in the UCB physics department ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
I don't think avahi gets a chance to send goodbye packets. More specifically i don't think NM or other mechanism actually tell avahi: Oh we're going to leave the network, please say goodbye and then give it a chance to actually send the necessary goodbyes Yes. The warning (we're changing the channel) could be there, but the opportunity (to send goodbye) would probably not, anyway. So you are probably right. Implementing this would require much effort (for a small achievement). ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
I believe our current salut/avahi issues are described in the following points: 1. I was under the impression that when a peer switches channels it sends a goodbye signal. And in fact only anorthodoxically removed peers(after crashes/poweroffs by pressing the button etc) would delay to disappear from mesh views. The 10min TTL is not unreasonable, but it should only be used for a routine check. In fact peers that leave/arrive should inform the mesh instantly. In that case the 10min TLL will only affect only the mesh points with noisy links that their goodbye signals will get lost. And these connections are less priority anyway. Also we could send 2/3 goodbye signals to ensure delivery. Mm, it seems that some dbus signal or the respective processing by the PS lacks. Is there a NM dbus signal when we change channels? This should be easy to determine. 2. We should definitely decrease the timeout window between a lost peer being detected, and the actual disappearance from the mesh view. This used to be 10min, now it is 20min, but really, to my experience, if a peer is for more than 1-2min away he aint coming back. For what you describe this does not seem related to the protocol itself, right? I believe it is important to achieve our goals without making the protocol more chatty. 3. Should we make the above TTL and timeout to be user specific, or custom anyway?. Will there be a problem if two XOs have different TTL? I would assume that it wont. The idea is that it is a waste of our resources to try to calculate the ideal values of TTL and timeout by asking the collabora team to fix, and fix again. Whereas we can make the test here in 1cc, and find ourselves which suits as best. Is it easy to implement such a patch? I believe it is useful to have some controls in order to help tuning things up. But not all of them need to be translated in user friendly controls. I believe your question would be how we could change this setting ourselves. Did I get it right? ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
2. It takes up to 10min for avahi even to detect the inactivity of a peer. i.e. If an XOs switches channels, for up to 10min avahi wont even know(it used to be 1-2min). Is this with or without the patch from bug #6162 ? If without, then the time it takes avahi to discover it should still be 2 mintues. I'd like to how you test this. Oh and please file a bug, so we can actually track these issues. The patch 6162, as well as the patch of 5501 are in included in the 689/690 that I am testing. So this indeed explains the 10minutes(Actually i just found out of this bug). 3. It will take a total of about 30min for the XO to vanish from the mesh view(this is tooo long!) Again, file a bug. Needed info here is if there is a time difference between when avahi marks something as removed, when salut sends out the removed signal and when it actually disappears from the mesh view. This is now filed as 6282, with all dbusmonitor/avahibrowse logs to compare. This case is also an example of a avahi/mesh view inconsistency. Icons disappear form the mesh view/ but remain for about 1h longer in the avahi cache But these details should continue in trac anyway. 4. Avahi/mesh view respond independently. The situation used to be that when an entry dissappeared in avahi, it disappeared in mesh view, and the same when new peers arrive. This relation was very consistent. However, now we have the following cases: a) an XO will vanish from the mesh view, but remain indefinitely in the avahi cache as failed to resolve b) sometimes avahi shows alot less peers than the mesh view. The extra peers in the mesh view are definitely active since they properly respond to activity joining/sharing. c)sometimes avahi included more active peers than the mesh view. does anyone know why this is happening? Is it a bug? I have logs, if needed, that compare avahi-browse with timestamped dbus-monitor logs, that indicate the inconsistencies. Well you all list them as undesired behaviour, so i would say they're bugs. 5. An important improvement is that peers will not generally fail alot on their own. So, if many XOs join a mesh channel, and noone goes away, the will not start failing. This used to be a common effect after 4-5 XOs. However, i noticed once in 1cc, 61 active XOs in the mesh view! When you say salut, you actually mean avahi. It would help if you could be clear on what you mean :) This improvement is probably caused by the fix in #5501. I mean avahi indeed. In the past these two were very tight to each other. And i believe that the only direct way to examine salut is by checking the buddy list in the Analyze activity. I remember Ricardo had an interesting case were the buddy list included plenty of XOs, which were also properly sharing in the mesh view, but the avahi list was empty. Does this seem possible? (unfortunately no log at the moment) Anyway for all the bugs you should have filed instead of sending this mail, i will need tcpdump logs, avahi logs, salut logs and if possible meshview logs indicating when contacts are removed from the mesh from a machine where you say the behaviour. Preferably with timestamps I updated the trac with logs/tcpdumps/dbusmon/screenshots...enjoy! The reason i send first this email before filing tons of bugs is because i though it was necessary to describe the big picture, and the current status of salut. And also to avoid duplicate bugs, or bugs that are in fact intentional mods. This conversation was unfortunately directed towards other issues(wireless difficulties is a sensitive subject at olpc!), but in fact its purpose was to determine some very specific bugs in salut, that have nothing to do at the point with scalability or robustness of the protocol. When these are resolved, we can proceed with scalability, for which i am very confident. I believe our current salut/avahi issues are described in the following points: 1. I was under the impression that when a peer switches channels it sends a goodbye signal. And in fact only anorthodoxically removed peers(after crashes/poweroffs by pressing the button etc) would delay to disappear from mesh views. The 10min TTL is not unreasonable, but it should only be used for a routine check. In fact peers that leave/arrive should inform the mesh instantly. In that case the 10min TLL will only affect only the mesh points with noisy links that their goodbye signals will get lost. And these connections are less priority anyway. Also we could send 2/3 goodbye signals to ensure delivery. 2. We should definitely decrease the timeout window between a lost peer being detected, and the actual disappearance from the mesh view. This used to be 10min, now it is 20min, but really, to my experience, if a peer is for more than 1-2min away he aint coming back. 3. Should we make the above TTL and timeout to be user specific, or custom anyway?. Will there be a problem if two
Re: Salut/avahi/meshview issues
On Jan 31, 2008 10:54 AM, Ricardo Carrano [EMAIL PROTECTED] wrote: I believe our current salut/avahi issues are described in the following points: 1. I was under the impression that when a peer switches channels it sends a goodbye signal. And in fact only anorthodoxically removed peers(after crashes/poweroffs by pressing the button etc) would delay to disappear from mesh views. The 10min TTL is not unreasonable, but it should only be used for a routine check. In fact peers that leave/arrive should inform the mesh instantly. In that case the 10min TLL will only affect only the mesh points with noisy links that their goodbye signals will get lost. And these connections are less priority anyway. Also we could send 2/3 goodbye signals to ensure delivery. Mm, it seems that some dbus signal or the respective processing by the PS lacks. Is there a NM dbus signal when we change channels? This should be easy to determine. It must be very easy for the PS to detect a channel change, or anyway when the XOs leaves the channel. The point is whether avahi supports such notifications, so the other peers can instantly remove the entry. 2. We should definitely decrease the timeout window between a lost peer being detected, and the actual disappearance from the mesh view. This used to be 10min, now it is 20min, but really, to my experience, if a peer is for more than 1-2min away he aint coming back. For what you describe this does not seem related to the protocol itself, right? I believe it is important to achieve our goals without making the protocol more chatty. This timeout is client specific, and doesnt affect the protocol itself at all. There reason this timeout exists(to my knowledge anyway), is that sometime a peer seems indiscoverable, but in fact it is just the effect of a poor link. So the peer rejoins shortly after. The effect would be XOs would move around the mesh view. To solve this issue, we wait for several minutes, before actually removing the XO. To my opinion the more we hide from the user, the more she gets confused. Keeping the icon in the mesh view while the connections is down, just messes things up. I also remember that there was the idea of keeping the lost icon in the mesh view, but notifying the user somehow, like change its outline to a dotted line or smth. But, this is a UI issue 3. Should we make the above TTL and timeout to be user specific, or custom anyway?. Will there be a problem if two XOs have different TTL? I would assume that it wont. The idea is that it is a waste of our resources to try to calculate the ideal values of TTL and timeout by asking the collabora team to fix, and fix again. Whereas we can make the test here in 1cc, and find ourselves which suits as best. Is it easy to implement such a patch? I believe it is useful to have some controls in order to help tuning things up. But not all of them need to be translated in user friendly controls. I believe your question would be how we could change this setting ourselves. Did I get it right? Exactly. By no means we need to have this controls user friendly. We only need the ability to tune them dynamically our selves for testing and evaluating purposes. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
On Wed, Jan 30, 2008 at 01:21:29AM -0500, Giannis Galanis wrote: The results were: 1. The xmas tree effect is still here. i.e. XOs occasionally vanish/reappear in differenent positions. This is because of the following: When the avahi cache includes several inactive/departed/(reported as failed) peers, and a new pear arrives, then all the inactive peers vanish from the screen instantly. (#5501) If their inactivity was temporary, then they will reappear shortly in a different location If for e.g. 3-4 XOs are (by user internention) moved simultaneously from ch6 to ch11, and then back to ch6, the icons wont have the time to disappear. BUT, the first to return to ch6 will cause the effect/bug to the others, which will instantly vanish. Shortly after they will naturally all return 1by1 to ch6 and will reappear in different locations. There was a patch for this issue(5501), which was included in 678+, but it has no effect. Please. Report your findings in the actual bug report so we can track it. And also, don't expect things to be fixed in reasonable timeframes if we have to wait more one month before our changes are actually tested. 2. It takes up to 10min for avahi even to detect the inactivity of a peer. i.e. If an XOs switches channels, for up to 10min avahi wont even know(it used to be 1-2min). Is this with or without the patch from bug #6162 ? If without, then the time it takes avahi to discover it should still be 2 mintues. I'd like to how you test this. Oh and please file a bug, so we can actually track these issues. 3. It will take a total of about 30min for the XO to vanish from the mesh view(this is tooo long!) Again, file a bug. Needed info here is if there is a time difference between when avahi marks something as removed, when salut sends out the removed signal and when it actually disappears from the mesh view. 4. Avahi/mesh view respond independently. The situation used to be that when an entry dissappeared in avahi, it disappeared in mesh view, and the same when new peers arrive. This relation was very consistent. However, now we have the following cases: a) an XO will vanish from the mesh view, but remain indefinitely in the avahi cache as failed to resolve b) sometimes avahi shows alot less peers than the mesh view. The extra peers in the mesh view are definitely active since they properly respond to activity joining/sharing. c)sometimes avahi included more active peers than the mesh view. does anyone know why this is happening? Is it a bug? I have logs, if needed, that compare avahi-browse with timestamped dbus-monitor logs, that indicate the inconsistencies. Well you all list them as undesired behaviour, so i would say they're bugs. 5. An important improvement is that peers will not generally fail alot on their own. So, if many XOs join a mesh channel, and noone goes away, the will not start failing. This used to be a common effect after 4-5 XOs. However, i noticed once in 1cc, 61 active XOs in the mesh view! When you say salut, you actually mean avahi. It would help if you could be clear on what you mean :) This improvement is probably caused by the fix in #5501. This shows that salut is more capable then we expected. Well both avahi and salut are quite capable. I'm not sure why it has such a bad reputation with you. Probably because your only seeing it in a very very exterme network and because there seems to be a lot of FUD about mdns going around. MDNS definately isn't an optimal protocol for a mesh, but most of the issues in big rollouts are actually caused by the wireless firmware not being good enough to do actual multicast routing. Anyway for all the bugs you should have filed instead of sending this mail, i will need tcpdump logs, avahi logs, salut logs and if possible meshview logs indicating when contacts are removed from the mesh from a machine where you say the behaviour. Preferably with timestamps Sjoerd -- I am what you will be; I was what you are. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
Sjoerd, Could you please develop this? What do you mean by wireless firmware not being good enough to do actual multicast routing.? Thanks, Ricardo Carrano Well both avahi and salut are quite capable. I'm not sure why it has such a bad reputation with you. Probably because your only seeing it in a very very exterme network and because there seems to be a lot of FUD about mdns going around. MDNS definately isn't an optimal protocol for a mesh, but most of the issues in big rollouts are actually caused by the wireless firmware not being good enough to do actual multicast routing. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
Sjoerd Simons [EMAIL PROTECTED] wrote on 01/30/2008 06:45:48 AM: Well both avahi and salut are quite capable. I'm not sure why it hassuch a bad reputation with you. Probably because your only seeing it in a very very exterme network and because there seems to be a lot of FUD about mdns going around. MDNS definately isn't an optimal protocol for a mesh, but most of the issues in big rollouts are actually caused by the wireless firmware not being good enough to do actual multicast routing. Have you given any thought to the overhead (just in traffic - let's leave memory out of the question right now) required to do what you call actual multicast routing in the firmware? We all understand how difficult is what we are trying to achieve. The firmware hasn't changed much since you started working on this project. So, let's drop the finger pointing and try to come up with realistic and implementable solutions. Yianni does testing, he doesn't care where specifically the problem is, all that he wants is to see something that works. M. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
On Wed, Jan 30, 2008 at 09:27:06AM -0500, Michail Bletsas wrote: Sjoerd Simons [EMAIL PROTECTED] wrote on 01/30/2008 06:45:48 AM: Well both avahi and salut are quite capable. I'm not sure why it hassuch a bad reputation with you. Probably because your only seeing it in a very very exterme network and because there seems to be a lot of FUD about mdns going around. MDNS definately isn't an optimal protocol for a mesh, but most of the issues in big rollouts are actually caused by the wireless firmware not being good enough to do actual multicast routing. Have you given any thought to the overhead (just in traffic - let's leave memory out of the question right now) required to do what you call actual multicast routing in the firmware? I did some research into mesh routing protocols before starting the salut muc work. From the research papers i've seen, proper multicast routing seems entirely viable. Traffic and memory overhead depend on the exact tradeoffs you make and the protocols used. So i see no reason why this can't be done on olpc's mesh network. We all understand how difficult is what we are trying to achieve. The firmware hasn't changed much since you started working on this project. So, let's drop the finger pointing and try to come up with realistic and implementable solutions. As said, from my point of view, proper multicast routing is an entirely realistic thing. Note that nobody is claiming MDNS is particularely suited for mesh networks. Because it's not. The reason why we used it, is that it was already used on the olpc mesh even before salut came along and we just didn't have the resources to do both a new presence protocol and a MUC protoocl. Also note that our muc protocol uses multicast, the rationale for that was outlined when we originally proposed telepathy. Now the exact rationale doesn't matter much. The point is that we've always been quite clear about the fact that we're heavily using multicast. And nobody ever claimed that this was a bad/unrelistic thing (at some points there were even interns at OLPC experimenting with reliable multicast on the mesh, so it seems that even inside olpc multicast was regarded as a good thing). So we always (maybe naievely) assume the mesh did/could do proper multicasting. When we discovered the mesh did not do proper multicasting, we did tell various people that this was going to be a bad thing. But apparently nobody ever seemed to think this was a big deal untill recently. So if you now want to say that multicast is not a viable solution and will never be, because for some reason it's unrealistic with the current mesh hardware. Please make this very very clear. Then at least we can throw more then a year of development effort out of the window and redo things from scratch. Yianni does testing, he doesn't care where specifically the problem is, all that he wants is to see something that works. Well for good testing he should have least have an idea where the problems are and what the issues involved are :) The scalability problem lies in the current combination of the mesh implemenation and the mdns traffic, how exactly we're going to solve that is still up for discussion. Sjoerd -- Each of us bears his own Hell. -- Publius Vergilius Maro (Virgil) ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
On Jan 30, 2008, at 5:12 PM, Michail Bletsas wrote: For completely serverless environments, what we have is invaluable. The fact that it doesn't scale to large numbers of nodes doesn't make it useless. I'm similarly confused about people's insistence on a rigid dichotomy between the approaches. I never regarded our mesh work to be aimed at replacing proper infrastructure -- its goal was to provide a viable (if degraded) transport when proper infrastructure was prohibitively expensive or otherwise not an option. We always knew that this approach carried scaling limits, and that's _fine_. As Michail says, this by no means makes the system useless. We have serious problems making Avahi and even the Jabber server do their thing with small numbers of nodes These two are very different. Avahi is hitting design and network limits. With Jabber, the problem is our ugly shared roster hack which makes the system do something it's not designed to; this is not an issue intrinsic to Jabber. -- Ivan Krstić [EMAIL PROTECTED] | http://radian.org ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
Sjoerd Simons [EMAIL PROTECTED] wrote on 01/30/2008 10:46:33 AM: I did some research into mesh routing protocols before starting the salut muc work. From the research papers i've seen, proper multicast routing seems entirely viable. Traffic and memory overhead depend on the exact tradeoffs you make and the protocols used. So i see no reason why this can't be done on olpc's mesh network. Given ample time, resources, many good programmers and a Turing machine, everything is possible. We have something more than a Turing machine, but we are having serious shortages on the other fronts. The distance from research papers to actual implementation is a great one. We all understand how difficult is what we are trying to achieve. The firmware hasn't changed much since you started working on this project. So, let's drop the finger pointing and try to come up with realistic and implementable solutions. As said, from my point of view, proper multicast routing is an entirely realistic thing. No it is not, given the constraints at hand. Note that nobody is claiming MDNS is particularely suited for mesh networks. Because it's not. The reason why we used it, is that it was already used on the olpc mesh even before salut came along and we just didn't have the resources to do both a new presence protocol and a MUC protoocl. Also note that our muc protocol uses multicast, the rationale for that was outlined when weoriginally proposed telepathy. Now the exact rationale doesn't matter much. The point is that we've always been quite clear about the fact that we're heavily using multicast. And nobody ever claimed that this was a bad/unrelistic thing (at some points there were even interns at OLPC experimenting with reliable multicast on the mesh, so it seems that even inside olpc multicast was regarded as a good thing). So we always (maybe naievely) assume the mesh did/could do proper multicasting. When we discovered the mesh did not do proper multicasting, we did tell various people that this was going to be a bad thing. But apparently nobody ever seemed to think this was a big deal untill recently. We have found that out way before you did, hence the need to be able to transition from a p2p mDNS approach to the unicast server based one. (What we still miss is the intermediate step, i.e. having XOs become presence servers -aka supernodes on demand) The fact that some people were shocked when they realized that you can not cram 500 XOs under one roof and still expect to be passing traffic around when you rely heavily on basic rate multicast over mesh is not a reason to radically rethink everything from scratch. We had discussed how important being able to control the flood would be very early on and hence the requirement for per application mesh TTL settings (so that we can even disable multicast flooding by setting the TTL to 1 for scenarios like the one in Mongolia) We can alway decrease the contention window if we increase the multicast rate. For completely serverless environments, what we have is invaluable. The fact that it doesn't scale to large numbers of nodes doesn't make it useless. Yianni does testing, he doesn't care where specifically the problem is, all that he wants is to see something that works. Well for good testing he should have least have an idea where the problems are and what the issues involved are :) The scalability problem lies in the current combination of the mesh implemenation and the mdns traffic, how exactly we're going to solve that is still up for discussion. I don't think that the issues that Yanni pointed out are directly related to the transport's multicast scalability issues. We have serious problems making Avahi and even the Jabber server do their thing with small numbers of nodes, so let's not blame the transport for everything. M. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
Allow me to offer a perspective. Last year I went to a trial school in Porto Alegre (South part of Brazil). We grabbed five XOs and went to the housing project where the children live. There, five kids could use the chat activity from their homes. Everyone was very excited. The possibilities of the mesh are huge (ok, within limits, like anything). Let's not forget that no infrastructure is the default condition around the world (at least in many years to come). This is not a waste of our time. On Jan 30, 2008 2:25 PM, Ivan Krstić [EMAIL PROTECTED] wrote: On Jan 30, 2008, at 5:12 PM, Michail Bletsas wrote: For completely serverless environments, what we have is invaluable. The fact that it doesn't scale to large numbers of nodes doesn't make it useless. I'm similarly confused about people's insistence on a rigid dichotomy between the approaches. I never regarded our mesh work to be aimed at replacing proper infrastructure -- its goal was to provide a viable (if degraded) transport when proper infrastructure was prohibitively expensive or otherwise not an option. We always knew that this approach carried scaling limits, and that's _fine_. As Michail says, this by no means makes the system useless. We have serious problems making Avahi and even the Jabber server do their thing with small numbers of nodes These two are very different. Avahi is hitting design and network limits. With Jabber, the problem is our ugly shared roster hack which makes the system do something it's not designed to; this is not an issue intrinsic to Jabber. -- Ivan Krstić [EMAIL PROTECTED] | http://radian.org ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
On Jan 30, 2008, at 6:34 PM, Ricardo Carrano wrote: This is not a waste of our time. Your reply is addressed to me, so I'm not sure whether you understood me to be implying that the mesh is a waste of our time. I was trying to say exactly the opposite. -- Ivan Krstić [EMAIL PROTECTED] | http://radian.org ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
I got your point. I apologize if my message was unclear on that. 2008/1/30 Ivan Krstić [EMAIL PROTECTED]: On Jan 30, 2008, at 6:34 PM, Ricardo Carrano wrote: This is not a waste of our time. Your reply is addressed to me, so I'm not sure whether you understood me to be implying that the mesh is a waste of our time. I was trying to say exactly the opposite. -- Ivan Krstić [EMAIL PROTECTED] | http://radian.org ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
On Wed, Jan 30, 2008 at 10:37:24AM -0200, Ricardo Carrano wrote: Sjoerd, Could you please develop this? What do you mean by wireless firmware not being good enough to do actual multicast routing.? Not good enough might be a bit harsh. What i mean is that as far as i know it doesn't implement any mesh multicast routing protocols. MAODV is a well-known example of a mesh multicast routing protocols and there are a whole bunch more. Wikipedia has a whole list[0] and if you look a bit into the research on MANET's a bit you'll probably find those and a whole lot more. If it would implement one of those, the point at which the network starts melting because of MDNS traffic should be a lot higher then it is now. Especially if you have a reasonably dense network, like say inside one school. Sjoerd 0: http://en.wikipedia.org/wiki/List_of_ad-hoc_routing_protocols#Multicast_Routing -- Each of us bears his own Hell. -- Publius Vergilius Maro (Virgil) ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
Sjoerd, I know the wikipedia list. Most papers (never implemented). And to my best knowledge none implemented at layer 2. On Jan 30, 2008 4:12 PM, Sjoerd Simons [EMAIL PROTECTED] wrote: On Wed, Jan 30, 2008 at 10:37:24AM -0200, Ricardo Carrano wrote: Sjoerd, Could you please develop this? What do you mean by wireless firmware not being good enough to do actual multicast routing.? Not good enough might be a bit harsh. What i mean is that as far as i know it doesn't implement any mesh multicast routing protocols. MAODV is a well-known example of a mesh multicast routing protocols and there are a whole bunch more. Wikipedia has a whole list[0] and if you look a bit into the research on MANET's a bit you'll probably find those and a whole lot more. If it would implement one of those, the point at which the network starts melting because of MDNS traffic should be a lot higher then it is now. Especially if you have a reasonably dense network, like say inside one school. Sjoerd 0: http://en.wikipedia.org/wiki/List_of_ad-hoc_routing_protocols#Multicast_Routing -- Each of us bears his own Hell. -- Publius Vergilius Maro (Virgil) ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Salut/avahi/meshview issues
Like Michail and Ricardo said, going from a paper publication to an actual implementation and also _testing_ of that implementation is a very long way. The following factors need to be taken into account when comparing various approaches to routing and presence in MANETs: 1) scalability: I would consider broadcasting a special case of multicasting and as such I assume this is a O(n^2) approach (this means that, on average, there are n packets in the network for each of n nodes) 2) mobility: Requiring our protocol to be able to handle mobile nodes eliminates a good portion of the literature for routing in ad-hoc networks. AODV is the most widely adopted algorithm for routing in MANETs. 3) simplicity: This is more important than it sounds. This is the factor that allows theory to turn into implementation. Multicasting in the mesh network does not scale, but it is relatively simple. My approach to provides presence information for a 100 nodes with a total overhead of 120Kbps at the worst case (everybody in range with each other, like in the school scenario). For 200 nodes, it would have an overhead of up to 240Kbps in the worse case and so on. Time resolution is at 10 seconds/hop, so for 5 hops it will take 50 seconds for a change to propagate from one side to the other. By doubling the time resolution to 20 secs/hop, the overhead gets halved to 60Kbps for 100 nodes, etc. The whole implementation is about 700 lines of python code, so this should provide a hint about its simplicity. I have implemented both the protocol and a simulator that runs multiple instances of the actual implementation, just to showcase its actual scalability. The problem is that running more than 50 nodes on my Centrino 1.8MHz uses up all available processing power and packets start getting dropped. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Salut/avahi/meshview issues
I understand that salut is not very popular lately since we are drifting mostly towards infra mode. Still, it is the preferable way for G1G1 laptops to talk to each other, since there is no SS, and the public jabber is not guaranteed, or in the future overcrowded. I have conducted several tests with a group of 9 XOs blinded with each other. The most important issues is the response of the mesh view, when an XO leaves the mesh. The results were: 1. The xmas tree effect is still here. i.e. XOs occasionally vanish/reappear in differenent positions. This is because of the following: When the avahi cache includes several inactive/departed/(reported as failed) peers, and a new pear arrives, then all the inactive peers vanish from the screen instantly. (#5501) If their inactivity was temporary, then they will reappear shortly in a different location If for e.g. 3-4 XOs are (by user internention) moved simultaneously from ch6 to ch11, and then back to ch6, the icons wont have the time to disappear. BUT, the first to return to ch6 will cause the effect/bug to the others, which will instantly vanish. Shortly after they will naturally all return 1by1 to ch6 and will reappear in different locations. There was a patch for this issue(5501), which was included in 678+, but it has no effect. 2. It takes up to 10min for avahi even to detect the inactivity of a peer. i.e. If an XOs switches channels, for up to 10min avahi wont even know(it used to be 1-2min). 3. It will take a total of about 30min for the XO to vanish from the mesh view(this is tooo long!) 4. Avahi/mesh view respond independently. The situation used to be that when an entry dissappeared in avahi, it disappeared in mesh view, and the same when new peers arrive. This relation was very consistent. However, now we have the following cases: a) an XO will vanish from the mesh view, but remain indefinitely in the avahi cache as failed to resolve b) sometimes avahi shows alot less peers than the mesh view. The extra peers in the mesh view are definitely active since they properly respond to activity joining/sharing. c)sometimes avahi included more active peers than the mesh view. does anyone know why this is happening? Is it a bug? I have logs, if needed, that compare avahi-browse with timestamped dbus-monitor logs, that indicate the inconsistencies. 5. An important improvement is that peers will not generally fail alot on their own. So, if many XOs join a mesh channel, and noone goes away, the will not start failing. This used to be a common effect after 4-5 XOs. However, i noticed once in 1cc, 61 active XOs in the mesh view! This shows that salut is more capable then we expected. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel