Re: Salut/avahi/meshview issues

2008-02-02 Thread Sjoerd Simons
On Thu, Jan 31, 2008 at 07:32:36AM -0500, Giannis Galanis wrote:
 I believe our current salut/avahi issues are described in the following
 points:
 
 1. I was under the impression that when a peer switches channels it sends a
 goodbye signal. And in fact only anorthodoxically removed peers(after
 crashes/poweroffs by pressing the button etc) would delay to disappear from
 mesh views.  The 10min TTL is not unreasonable, but it should only be used
 for a routine check. In fact peers that leave/arrive should inform the mesh
 instantly. In that case the 10min TLL will only affect only the mesh points
 with noisy links that their goodbye signals will get lost. And these
 connections are less priority anyway. Also we could send 2/3 goodbye
 signals to ensure delivery.

I don't think avahi gets a chance to send goodbye packets. More specifically i
don't think NM or other mechanism actually tell avahi: Oh we're going to leave
the network, please say goodbye and then give it a chance to actually send the
necessary goodbyes

 2. We should definitely decrease the timeout window between a lost peer
 being detected, and the actual disappearance from the mesh view. This used
 to be 10min, now it is 20min, but really, to my experience, if a peer is for
 more than 1-2min away he aint coming back.

In the code it's actually 12m + the time it takes avahi to conclude a node has
gone. So this used to be around 14 minutes maximally, but with the upped TTL to
10 min it will be around 22 minutes. It might be interesting to see if with the
latest patches the amount of false-negatives has gone down so much that we can
remove the or at least decrease the slack time we add after a node has gone in
avahi.

 3. Should we make the above TTL and timeout to be user specific, or custom
 anyway?. Will there be a problem if two XOs have different TTL? I would
 assume that it wont. The idea is that it is a waste of our resources to try
 to calculate the ideal values of TTL and timeout by asking the collabora
 team to fix, and fix again. Whereas we can make the test here in 1cc, and
 find ourselves which suits as best. Is it easy to implement such a patch?

 4. The 5501 bug(xmas tree effect). This is a very specific bug in the
 protocol, and i believe it will be sorted soon.

This one is fixed right?

 5. Why are avahi/salut/mesh view not communicating well? I hope we will have
 some answers on that as well.

I'm not sure. If salut and the mesh view fail to communicate, the same problem
should show up with gabble.

  Sjoerd
-- 
Consider a spherical bear, in simple harmonic motion...
-- Professor in the UCB physics department
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-02-02 Thread Ricardo Carrano
 I don't think avahi gets a chance to send goodbye packets. More
 specifically i
 don't think NM or other mechanism actually tell avahi: Oh we're going to
 leave
 the network, please say goodbye and then give it a chance to actually send
 the
 necessary goodbyes


Yes. The warning (we're changing the channel) could be there, but the
opportunity (to send goodbye) would probably not, anyway.
So you are probably right. Implementing this would require much effort (for
a small achievement).
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-31 Thread Ricardo Carrano
 I believe our current salut/avahi issues are described in the following
 points:

 1. I was under the impression that when a peer switches channels it sends
 a goodbye signal. And in fact only anorthodoxically removed peers(after
 crashes/poweroffs by pressing the button etc) would delay to disappear from
 mesh views.  The 10min TTL is not unreasonable, but it should only be used
 for a routine check. In fact peers that leave/arrive should inform the mesh
 instantly. In that case the 10min TLL will only affect only the mesh points
 with noisy links that their goodbye signals will get lost. And these
 connections are less priority anyway. Also we could send 2/3 goodbye
 signals to ensure delivery.


Mm, it seems that some dbus signal or the respective processing by the PS
lacks. Is there a NM dbus signal when we change channels? This should be
easy to determine.




 2. We should definitely decrease the timeout window between a lost peer
 being detected, and the actual disappearance from the mesh view. This used
 to be 10min, now it is 20min, but really, to my experience, if a peer is for
 more than 1-2min away he aint coming back.


For what you describe this does not seem related to the protocol itself,
right? I believe it is important to achieve our goals without making the
protocol more chatty.



 3. Should we make the above TTL and timeout to be user specific, or custom
 anyway?. Will there be a problem if two XOs have different TTL? I would
 assume that it wont. The idea is that it is a waste of our resources to try
 to calculate the ideal values of TTL and timeout by asking the collabora
 team to fix, and fix again. Whereas we can make the test here in 1cc, and
 find ourselves which suits as best. Is it easy to implement such a patch?


I believe it  is useful to have  some controls  in order to help  tuning
things up.  But not all of them need to be translated in user friendly
controls. I believe your question would be how we could change this setting
ourselves. Did I get it right?
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-31 Thread Giannis Galanis
  2. It takes up to 10min for avahi even to detect the inactivity of a
 peer.
  i.e. If an XOs switches channels, for up to 10min avahi wont even
 know(it
  used to be 1-2min).

 Is this with or without the patch from bug #6162 ? If without, then the
 time it
 takes avahi to discover it should still be 2 mintues. I'd like to how you
 test
 this. Oh and please file a bug, so we can actually track these issues.


The patch 6162, as well as the patch of 5501 are in included in the 689/690
that I am testing. So this indeed explains the 10minutes(Actually i just
found out of this bug).


  3. It will take a total of about 30min for the XO to vanish from the
 mesh
  view(this is tooo long!)

 Again, file a bug. Needed info here is if there is a time difference
 between
 when avahi marks something as removed, when salut sends out the removed
 signal
 and when it actually disappears from the mesh view.


This is now filed as 6282, with all dbusmonitor/avahibrowse logs to
compare.
This case is also an example of a avahi/mesh view inconsistency.
Icons disappear form the mesh view/ but remain for about 1h longer in the
avahi cache
But these details should continue in trac anyway.



  4. Avahi/mesh view respond independently.
  The situation used to be that when an entry dissappeared in avahi, it
  disappeared in mesh view, and the same when new peers arrive.
  This relation was very consistent.
  However, now we have the following cases:
  a) an XO will vanish from the mesh view, but remain indefinitely in
 the
  avahi cache as failed to resolve
  b) sometimes avahi shows alot less peers than the mesh view. The extra
 peers
  in the mesh view are definitely active since they properly respond to
  activity joining/sharing.
  c)sometimes avahi included more active peers than the mesh view.
  does anyone know why this is happening?
  Is it a bug?
  I have logs, if needed, that compare avahi-browse with timestamped
  dbus-monitor logs, that indicate the inconsistencies.

 Well you all list them as undesired behaviour, so i would say they're
 bugs.



  5. An important improvement is that peers will not generally fail alot
 on
  their own.
  So, if many XOs join a mesh channel, and noone goes away, the will not
 start
  failing. This used to be a common effect after 4-5 XOs. However, i
 noticed
  once in 1cc, 61 active XOs in the mesh view!

 When you say salut, you actually mean avahi. It would help if you could be
 clear on what you mean :) This improvement is probably caused by the fix
 in
 #5501.


I mean avahi indeed. In the past these two were very tight to each other.
And i believe that the only direct way to examine salut is by checking the
buddy list in the Analyze activity.
I remember Ricardo had an interesting case were the buddy list included
plenty of XOs, which were also properly sharing in the mesh view, but the
avahi list was empty. Does this seem possible? (unfortunately no log at the
moment)



 Anyway for all the bugs you should have filed instead of sending this
 mail, i
 will need tcpdump logs, avahi logs, salut logs and if possible meshview
 logs
 indicating when contacts are removed from the mesh from a machine where
 you say
 the behaviour. Preferably with timestamps


I updated the trac with logs/tcpdumps/dbusmon/screenshots...enjoy!

The reason i send first this email before filing tons of bugs is because i
though it was necessary to describe the big picture, and the current status
of salut. And also to avoid duplicate bugs, or bugs that are in fact
intentional mods.

This conversation was unfortunately directed towards other issues(wireless
difficulties is a sensitive subject at olpc!), but in fact its purpose was
to determine some very specific bugs in salut, that have nothing to do at
the point with scalability or robustness of the protocol.  When these are
resolved, we can proceed with scalability, for which i am very confident.

I believe our current salut/avahi issues are described in the following
points:

1. I was under the impression that when a peer switches channels it sends a
goodbye signal. And in fact only anorthodoxically removed peers(after
crashes/poweroffs by pressing the button etc) would delay to disappear from
mesh views.  The 10min TTL is not unreasonable, but it should only be used
for a routine check. In fact peers that leave/arrive should inform the mesh
instantly. In that case the 10min TLL will only affect only the mesh points
with noisy links that their goodbye signals will get lost. And these
connections are less priority anyway. Also we could send 2/3 goodbye
signals to ensure delivery.

2. We should definitely decrease the timeout window between a lost peer
being detected, and the actual disappearance from the mesh view. This used
to be 10min, now it is 20min, but really, to my experience, if a peer is for
more than 1-2min away he aint coming back.

3. Should we make the above TTL and timeout to be user specific, or custom
anyway?. Will there be a problem if two 

Re: Salut/avahi/meshview issues

2008-01-31 Thread Giannis Galanis
On Jan 31, 2008 10:54 AM, Ricardo Carrano [EMAIL PROTECTED]
wrote:


 I believe our current salut/avahi issues are described in the following
  points:
 
  1. I was under the impression that when a peer switches channels it
  sends a goodbye signal. And in fact only anorthodoxically removed
  peers(after crashes/poweroffs by pressing the button etc) would delay to
  disappear from mesh views.  The 10min TTL is not unreasonable, but it should
  only be used for a routine check. In fact peers that leave/arrive should
  inform the mesh instantly. In that case the 10min TLL will only affect only
  the mesh points with noisy links that their goodbye signals will get lost.
  And these connections are less priority anyway. Also we could send 2/3
  goodbye signals to ensure delivery.


 Mm, it seems that some dbus signal or the respective processing by the PS
 lacks. Is there a NM dbus signal when we change channels? This should be
 easy to determine.

 
It must be very easy for the PS to detect a channel change, or anyway when
the XOs leaves the channel. The point is whether avahi supports such
notifications, so the other peers can instantly remove the entry.



  2. We should definitely decrease the timeout window between a lost peer
  being detected, and the actual disappearance from the mesh view. This used
  to be 10min, now it is 20min, but really, to my experience, if a peer is for
  more than 1-2min away he aint coming back.


 For what you describe this does not seem related to the protocol itself,
 right? I believe it is important to achieve our goals without making the
 protocol more chatty.

 
This timeout is client specific, and doesnt affect the protocol itself at
all. There reason this timeout exists(to my knowledge anyway), is that
sometime a peer seems indiscoverable, but in fact it is just the effect of a
poor link. So the peer rejoins shortly after. The effect would be XOs would
move around the mesh view. To solve this issue, we wait for several minutes,
before actually removing the XO.
To my opinion the more we hide from the user, the more she gets confused.
Keeping the icon in the mesh view while the connections is down, just messes
things up.
I also remember that there was the idea of keeping the lost icon in the
mesh view, but notifying the user somehow, like change its outline to a
dotted line or smth. But, this is a UI issue



 
  3. Should we make the above TTL and timeout to be user specific, or
  custom anyway?. Will there be a problem if two XOs have different TTL? I
  would assume that it wont. The idea is that it is a waste of our resources
  to try to calculate the ideal values of TTL and timeout by asking the
  collabora team to fix, and fix again. Whereas we can make the test here in
  1cc, and find ourselves which suits as best. Is it easy to implement such a
  patch?


 I believe it  is useful to have  some controls  in order to help  tuning
 things up.  But not all of them need to be translated in user friendly
 controls. I believe your question would be how we could change this setting
 ourselves. Did I get it right?

 Exactly. By no means we need to have this controls user friendly. We only
need the ability to tune them dynamically our selves for testing and
evaluating purposes.
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Sjoerd Simons
On Wed, Jan 30, 2008 at 01:21:29AM -0500, Giannis Galanis wrote:
 The results were:
 
 1.  The xmas tree effect is still here.
 i.e. XOs occasionally vanish/reappear in differenent positions.
 This is because of the following:
 When the avahi cache includes several inactive/departed/(reported as failed)
 peers,
 and a new pear arrives,
 then all the inactive peers vanish from the screen instantly. (#5501)
 If their inactivity was temporary, then they will reappear shortly in a
 different location
 If for e.g. 3-4 XOs are (by user internention) moved simultaneously from ch6
 to ch11, and then back to ch6, the icons wont have the time to disappear.
 BUT, the first to return to ch6 will cause the effect/bug to the others,
 which will instantly vanish. Shortly after they will naturally all return
 1by1 to ch6 and will reappear in different locations.
 There was a patch for this issue(5501), which was included in 678+, but it
 has no effect.

Please. Report your findings in the actual bug report so we can track it. And
also, don't expect things to be fixed in reasonable timeframes if we have to
wait more one month before our changes are actually tested.

 2. It takes up to 10min for avahi even to detect the inactivity of a peer.
 i.e. If an XOs switches channels, for up to 10min avahi wont even know(it
 used to be 1-2min).

Is this with or without the patch from bug #6162 ? If without, then the time it
takes avahi to discover it should still be 2 mintues. I'd like to how you test
this. Oh and please file a bug, so we can actually track these issues.

 3. It will take a total of about 30min for the XO to vanish from the mesh
 view(this is tooo long!)

Again, file a bug. Needed info here is if there is a time difference between
when avahi marks something as removed, when salut sends out the removed signal 
and when it actually disappears from the mesh view.

 4. Avahi/mesh view respond independently.
 The situation used to be that when an entry dissappeared in avahi, it
 disappeared in mesh view, and the same when new peers arrive.
 This relation was very consistent.
 However, now we have the following cases:
 a) an XO will vanish from the mesh view, but remain indefinitely in the
 avahi cache as failed to resolve
 b) sometimes avahi shows alot less peers than the mesh view. The extra peers
 in the mesh view are definitely active since they properly respond to
 activity joining/sharing.
 c)sometimes avahi included more active peers than the mesh view.
 does anyone know why this is happening?
 Is it a bug?
 I have logs, if needed, that compare avahi-browse with timestamped
 dbus-monitor logs, that indicate the inconsistencies.

Well you all list them as undesired behaviour, so i would say they're bugs.

 5. An important improvement is that peers will not generally fail alot on
 their own.
 So, if many XOs join a mesh channel, and noone goes away, the will not start
 failing. This used to be a common effect after 4-5 XOs. However, i noticed
 once in 1cc, 61 active XOs in the mesh view!

When you say salut, you actually mean avahi. It would help if you could be
clear on what you mean :) This improvement is probably caused by the fix in
#5501.

 This shows that salut is more capable then we expected.

Well both avahi and salut are quite capable. I'm not sure why it has such a bad
reputation with you. Probably because your only seeing it in a very very
exterme network and because there seems to be a lot of FUD about mdns going
around. MDNS definately isn't an optimal protocol for a mesh, but most of the
issues in big rollouts are actually caused by the wireless firmware not being
good enough to do actual multicast routing.


Anyway for all the bugs you should have filed instead of sending this mail, i
will need tcpdump logs, avahi logs, salut logs and if possible meshview logs
indicating when contacts are removed from the mesh from a machine where you say
the behaviour. Preferably with timestamps

  Sjoerd
-- 
I am what you will be; I was what you are.
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Ricardo Carrano
Sjoerd,

Could you please develop this? What do you mean by wireless firmware not
being
good enough to do actual multicast routing.?

Thanks,
Ricardo Carrano

Well both avahi and salut are quite capable. I'm not sure why it has such a
 bad
 reputation with you. Probably because your only seeing it in a very very
 exterme network and because there seems to be a lot of FUD about mdns
 going
 around. MDNS definately isn't an optimal protocol for a mesh, but most of
 the
 issues in big rollouts are actually caused by the wireless firmware not
 being
 good enough to do actual multicast routing.
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Michail Bletsas
Sjoerd Simons [EMAIL PROTECTED] wrote on 01/30/2008 06:45:48 AM:

 
 Well both avahi and salut are quite capable. I'm not sure why it hassuch 
a bad
 reputation with you. Probably because your only seeing it in a very very
 exterme network and because there seems to be a lot of FUD about mdns 
going
 around. MDNS definately isn't an optimal protocol for a mesh, but most 
of the
 issues in big rollouts are actually caused by the wireless firmware not 
being
 good enough to do actual multicast routing.
 
 
Have you given any thought to the overhead (just in traffic - let's leave 
memory out of the question right now) required to do what you call actual 
multicast routing in the firmware? We all understand how difficult is 
what we are trying to achieve. The firmware hasn't changed much since you 
started working on this project. So, let's drop the finger pointing and 
try to come up with realistic and implementable solutions.

Yianni does testing, he doesn't care where specifically the problem is, 
all that he wants is to see something that works.

M.


___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Sjoerd Simons
On Wed, Jan 30, 2008 at 09:27:06AM -0500, Michail Bletsas wrote:
 Sjoerd Simons [EMAIL PROTECTED] wrote on 01/30/2008 06:45:48 AM:
 
  
  Well both avahi and salut are quite capable. I'm not sure why it hassuch 
 a bad
  reputation with you. Probably because your only seeing it in a very very
  exterme network and because there seems to be a lot of FUD about mdns 
 going
  around. MDNS definately isn't an optimal protocol for a mesh, but most 
 of the
  issues in big rollouts are actually caused by the wireless firmware not 
 being
  good enough to do actual multicast routing.
  
  
 Have you given any thought to the overhead (just in traffic - let's leave 
 memory out of the question right now) required to do what you call actual 
 multicast routing in the firmware? 

I did some research into mesh routing protocols before starting the salut muc
work. From the research papers i've seen, proper multicast routing seems
entirely viable. Traffic and memory overhead depend on the exact tradeoffs you
make and the protocols used. So i see no reason why this can't be done on
olpc's mesh network.

 We all understand how difficult is what we are trying to achieve. The
 firmware hasn't changed much since you started working on this project. So,
 let's drop the finger pointing and try to come up with realistic and
 implementable solutions.

As said, from my point of view, proper multicast routing is an entirely
realistic thing.

Note that nobody is claiming MDNS is particularely suited for mesh networks.
Because it's not. The reason why we used it, is that it was already used on the
olpc mesh even before salut came along and we just didn't have the resources to
do both a new presence protocol and a MUC protoocl. Also note that our muc
protocol uses multicast, the rationale for that was outlined when we originally
proposed telepathy.

Now the exact rationale doesn't matter much. The point is that we've always
been quite clear about the fact that we're heavily using multicast. And nobody
ever claimed that this was a bad/unrelistic thing (at some points there were
even interns at OLPC experimenting with reliable multicast on the mesh, so it
seems that even inside olpc multicast was regarded as a good thing). So we
always (maybe naievely) assume the mesh did/could do proper multicasting.

When we discovered the mesh did not do proper multicasting, we did tell various
people that this was going to be a bad thing. But apparently nobody ever seemed
to think this was a big deal untill recently.


So if you now want to say that multicast is not a viable solution and will
never be, because for some reason it's unrealistic with the current mesh
hardware. Please make this very very clear. Then at least we can throw more
then a year of development effort out of the window and redo things from
scratch.

 Yianni does testing, he doesn't care where specifically the problem is,
 all that he wants is to see something that works.

Well for good testing he should have least have an idea where the problems are
and what the issues involved are :) The scalability problem lies in the current
combination of the mesh implemenation and the mdns traffic, how exactly we're
going to solve that is still up for discussion.

  Sjoerd
-- 
Each of us bears his own Hell.
-- Publius Vergilius Maro (Virgil)
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Ivan Krstić
On Jan 30, 2008, at 5:12 PM, Michail Bletsas wrote:
 For completely serverless environments, what we have is invaluable.  
 The
 fact that it doesn't scale to large numbers of nodes doesn't make it
 useless.

I'm similarly confused about people's insistence on a rigid dichotomy  
between the approaches. I never regarded our mesh work to be aimed at  
replacing proper infrastructure -- its goal was to provide a viable  
(if degraded) transport when proper infrastructure was prohibitively  
expensive or otherwise not an option. We always knew that this  
approach carried scaling limits, and that's _fine_. As Michail says,  
this by no means makes the system useless.

 We have serious problems making Avahi and even the Jabber server do  
 their
 thing with small numbers of nodes

These two are very different. Avahi is hitting design and network  
limits. With Jabber, the problem is our ugly shared roster hack which  
makes the system do something it's not designed to; this is not an  
issue intrinsic to Jabber.

--
Ivan Krstić [EMAIL PROTECTED] | http://radian.org

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Michail Bletsas
Sjoerd Simons [EMAIL PROTECTED] wrote on 01/30/2008 10:46:33 AM:


 
 I did some research into mesh routing protocols before starting the 
salut muc
 work. From the research papers i've seen, proper multicast routing seems
 entirely viable. Traffic and memory overhead depend on the exact 
tradeoffs you
 make and the protocols used. So i see no reason why this can't be done 
on
 olpc's mesh network.

Given ample time, resources, many good programmers and a Turing machine, 
everything is possible.
We have something more than a Turing machine, but we are having serious 
shortages on the other fronts.
The distance from research papers to actual implementation is a great one.

 
  We all understand how difficult is what we are trying to achieve. The
  firmware hasn't changed much since you started working on this 
project. So,
  let's drop the finger pointing and try to come up with realistic and
  implementable solutions.
 
 As said, from my point of view, proper multicast routing is an entirely
 realistic thing.
No it is not, given the constraints at hand.

 
 Note that nobody is claiming MDNS is particularely suited for mesh 
networks.
 Because it's not. The reason why we used it, is that it was already 
 used on the
 olpc mesh even before salut came along and we just didn't have the 
 resources to
 do both a new presence protocol and a MUC protoocl. Also note that our 
muc
 protocol uses multicast, the rationale for that was outlined when 
weoriginally
 proposed telepathy.
 
 Now the exact rationale doesn't matter much. The point is that we've 
always
 been quite clear about the fact that we're heavily using multicast. And 
nobody
 ever claimed that this was a bad/unrelistic thing (at some points there 
were
 even interns at OLPC experimenting with reliable multicast on the mesh, 
so it
 seems that even inside olpc multicast was regarded as a good thing). So 
we
 always (maybe naievely) assume the mesh did/could do proper 
multicasting.
 
 When we discovered the mesh did not do proper multicasting, we did 
 tell various
 people that this was going to be a bad thing. But apparently nobody 
 ever seemed
 to think this was a big deal untill recently.

We have found that out way before you did, hence the need to be able to 
transition from a p2p mDNS approach to the unicast server based one.
(What we still miss is the intermediate step, i.e. having XOs become 
presence servers -aka supernodes on demand)
The fact that some people were shocked when they realized that you can not 
cram 500 XOs under one roof and still expect to be passing traffic around 
when you rely heavily on basic rate multicast over mesh is not a reason to 
radically rethink everything from scratch.
We had discussed how important being able to control the flood would be 
very early on and hence the requirement for per application mesh TTL 
settings (so that we can even disable multicast flooding by setting the 
TTL to 1 for scenarios like the one in Mongolia) We can alway decrease the 
contention window if we increase the multicast rate. 

For completely serverless environments, what we have is invaluable. The 
fact that it doesn't scale to large numbers of nodes doesn't make it 
useless.


  Yianni does testing, he doesn't care where specifically the problem 
is,
  all that he wants is to see something that works.
 
 Well for good testing he should have least have an idea where the 
problems are
 and what the issues involved are :) The scalability problem lies in 
 the current
 combination of the mesh implemenation and the mdns traffic, how exactly 
we're
 going to solve that is still up for discussion.
 
I don't think that the issues that Yanni pointed out are directly related 
to the transport's multicast scalability issues.
We have serious problems making Avahi and even the Jabber server do their 
thing with small numbers of nodes, so let's not blame the transport for 
everything.

M.


___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Ricardo Carrano
Allow me to offer a perspective.

Last year I went to a trial school in Porto Alegre (South part of Brazil).
We grabbed five XOs and went to the housing project where the children live.
There, five kids could use the chat activity from their homes. Everyone was
very excited. The possibilities of the mesh are huge (ok, within limits,
like anything). Let's not forget that no infrastructure is the default
condition around the  world (at least in many years to come).

This is not a waste of our time.

On Jan 30, 2008 2:25 PM, Ivan Krstić [EMAIL PROTECTED]
wrote:

 On Jan 30, 2008, at 5:12 PM, Michail Bletsas wrote:
  For completely serverless environments, what we have is invaluable.
  The
  fact that it doesn't scale to large numbers of nodes doesn't make it
  useless.

 I'm similarly confused about people's insistence on a rigid dichotomy
 between the approaches. I never regarded our mesh work to be aimed at
 replacing proper infrastructure -- its goal was to provide a viable
 (if degraded) transport when proper infrastructure was prohibitively
 expensive or otherwise not an option. We always knew that this
 approach carried scaling limits, and that's _fine_. As Michail says,
 this by no means makes the system useless.

  We have serious problems making Avahi and even the Jabber server do
  their
  thing with small numbers of nodes

 These two are very different. Avahi is hitting design and network
 limits. With Jabber, the problem is our ugly shared roster hack which
 makes the system do something it's not designed to; this is not an
 issue intrinsic to Jabber.

 --
 Ivan Krstić [EMAIL PROTECTED] | http://radian.org


___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Ivan Krstić
On Jan 30, 2008, at 6:34 PM, Ricardo Carrano wrote:
 This is not a waste of our time.

Your reply is addressed to me, so I'm not sure whether you understood  
me to be implying that the mesh is a waste of our time. I was trying  
to say exactly the opposite.

--
Ivan Krstić [EMAIL PROTECTED] | http://radian.org

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Ricardo Carrano
I got your point.
I apologize if my message was unclear on that.

2008/1/30 Ivan Krstić [EMAIL PROTECTED]:

 On Jan 30, 2008, at 6:34 PM, Ricardo Carrano wrote:
  This is not a waste of our time.

 Your reply is addressed to me, so I'm not sure whether you understood
 me to be implying that the mesh is a waste of our time. I was trying
 to say exactly the opposite.

 --
 Ivan Krstić [EMAIL PROTECTED] | http://radian.org


___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Sjoerd Simons
On Wed, Jan 30, 2008 at 10:37:24AM -0200, Ricardo Carrano wrote:
 Sjoerd,
 
 Could you please develop this? What do you mean by wireless firmware not
 being good enough to do actual multicast routing.?

Not good enough might be a bit harsh. What i mean is that as far as i know it
doesn't implement any mesh multicast routing protocols. MAODV is a well-known
example of a mesh multicast routing protocols and there are a whole bunch more.
Wikipedia has a whole list[0] and if you look a bit into the research on
MANET's a bit you'll probably find those and a whole lot more.

If it would implement one of those, the point at which the network starts
melting because of MDNS traffic should be a lot higher then it is now.
Especially if you have a reasonably dense network, like say inside one school.

  Sjoerd
0: 
http://en.wikipedia.org/wiki/List_of_ad-hoc_routing_protocols#Multicast_Routing
-- 
Each of us bears his own Hell.
-- Publius Vergilius Maro (Virgil)
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Ricardo Carrano
Sjoerd,

I know the  wikipedia list. Most papers (never implemented). And to my best
knowledge none implemented at layer 2.

On Jan 30, 2008 4:12 PM, Sjoerd Simons [EMAIL PROTECTED] wrote:

 On Wed, Jan 30, 2008 at 10:37:24AM -0200, Ricardo Carrano wrote:
  Sjoerd,
 
  Could you please develop this? What do you mean by wireless firmware
 not
  being good enough to do actual multicast routing.?

 Not good enough might be a bit harsh. What i mean is that as far as i know
 it
 doesn't implement any mesh multicast routing protocols. MAODV is a
 well-known
 example of a mesh multicast routing protocols and there are a whole bunch
 more.
 Wikipedia has a whole list[0] and if you look a bit into the research on
 MANET's a bit you'll probably find those and a whole lot more.

 If it would implement one of those, the point at which the network starts
 melting because of MDNS traffic should be a lot higher then it is now.
 Especially if you have a reasonably dense network, like say inside one
 school.

  Sjoerd
 0:
 http://en.wikipedia.org/wiki/List_of_ad-hoc_routing_protocols#Multicast_Routing
 --
 Each of us bears his own Hell.
-- Publius Vergilius Maro (Virgil)

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Salut/avahi/meshview issues

2008-01-30 Thread Polychronis Ypodimatopoulos
Like Michail and Ricardo said, going from a paper publication to an 
actual implementation and also _testing_ of that implementation is a 
very long way. The following factors need to be taken into account when 
comparing various approaches to routing and presence in MANETs:

1) scalability: I would consider broadcasting a special case of 
multicasting and as such I assume this is a O(n^2) approach (this means 
that, on average, there are n packets in the network for each of n nodes)

2) mobility: Requiring our protocol to be able to handle mobile nodes 
eliminates a good portion of the literature for routing in ad-hoc 
networks. AODV is the most widely adopted algorithm for routing in MANETs.

3) simplicity: This is more important than it sounds. This is the factor 
that allows theory to turn into implementation. Multicasting in the mesh 
network does not scale, but it is relatively simple.

My approach to provides presence information for a 100 nodes with a 
total overhead of 120Kbps at the worst case (everybody in range with 
each other, like in the school scenario). For 200 nodes, it would have 
an overhead of up to 240Kbps in the worse case and so on. Time 
resolution is at 10 seconds/hop, so for 5 hops it will take 50 seconds 
for a change to propagate from one side to the other. By doubling the 
time resolution to 20 secs/hop, the overhead gets halved to 60Kbps for 
100 nodes, etc.

The whole implementation is about 700 lines of python code, so this 
should provide a hint about its simplicity. I have implemented both the 
protocol and a simulator that runs multiple instances of the actual 
implementation, just to showcase its actual scalability. The problem is 
that running more than 50 nodes on my Centrino 1.8MHz uses up all 
available processing power and packets start getting dropped.



___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Salut/avahi/meshview issues

2008-01-29 Thread Giannis Galanis
I understand that salut is not very popular lately since we are drifting
mostly towards infra mode.
Still, it is the preferable way for G1G1 laptops to talk to each other,
since there is no SS, and the public jabber is not guaranteed, or in the
future overcrowded.

I have conducted several tests with a group of 9 XOs blinded with each
other.
The most important issues is the response of the mesh view, when an XO
leaves the mesh.

The results were:

1.  The xmas tree effect is still here.
i.e. XOs occasionally vanish/reappear in differenent positions.
This is because of the following:
When the avahi cache includes several inactive/departed/(reported as failed)
peers,
and a new pear arrives,
then all the inactive peers vanish from the screen instantly. (#5501)
If their inactivity was temporary, then they will reappear shortly in a
different location
If for e.g. 3-4 XOs are (by user internention) moved simultaneously from ch6
to ch11, and then back to ch6, the icons wont have the time to disappear.
BUT, the first to return to ch6 will cause the effect/bug to the others,
which will instantly vanish. Shortly after they will naturally all return
1by1 to ch6 and will reappear in different locations.
There was a patch for this issue(5501), which was included in 678+, but it
has no effect.

2. It takes up to 10min for avahi even to detect the inactivity of a peer.
i.e. If an XOs switches channels, for up to 10min avahi wont even know(it
used to be 1-2min).

3. It will take a total of about 30min for the XO to vanish from the mesh
view(this is tooo long!)

4. Avahi/mesh view respond independently.
The situation used to be that when an entry dissappeared in avahi, it
disappeared in mesh view, and the same when new peers arrive.
This relation was very consistent.
However, now we have the following cases:
a) an XO will vanish from the mesh view, but remain indefinitely in the
avahi cache as failed to resolve
b) sometimes avahi shows alot less peers than the mesh view. The extra peers
in the mesh view are definitely active since they properly respond to
activity joining/sharing.
c)sometimes avahi included more active peers than the mesh view.
does anyone know why this is happening?
Is it a bug?
I have logs, if needed, that compare avahi-browse with timestamped
dbus-monitor logs, that indicate the inconsistencies.

5. An important improvement is that peers will not generally fail alot on
their own.
So, if many XOs join a mesh channel, and noone goes away, the will not start
failing. This used to be a common effect after 4-5 XOs. However, i noticed
once in 1cc, 61 active XOs in the mesh view! This shows that salut is more
capable then we expected.
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel