I had some success in trying to reproduce the roaming problems of our WLAN. I'm used an automated experiment where the access point's channel is being changed and then the Neo tries to ping the access point.
It seems that the number one culprit is a bug that causes the firmware to crash. Scripts to run the experiment and to analyze the data are here: http://svn.openmoko.org/developers/werner/wlan/freeze/ An example log is here: http://people.openmoko.org/werner/wlan-freeze/4.bz2 Some statistics of what's in this file: - 308 frequency changes, of which - 244 (79%) happened smoothly, most taking only a few seconds, including the connectivity test. A few took much longer, but that's mainly because some packet got delayed and ping overestimated the RTT. - 53 (17%) times, the module returned to the previous frequency, and either happily continued there (thanks to frequency leakage, it seems) or noticed the error after a while and then re-associated again, this time getting it right. - once it got completely confused and picked a frequency that was neither the old nor the new one, but somehow managed to struggle on - in ten cases (3%) something more sinister happend. It failed to return at least 100 good pings within 120 seconds after the frequency change. In these ten cases, - two could be recovered (*) by issuing an iwconfig to force re-association - one was resolved (*) by doing an "iwlist scan", perhaps reminding the module that it was sitting at the wrong frequency - in one mystery case, nothing looked out of place, yet communication stayed dead and only a module reset could fix it. Perhaps waiting a bit longer could have helped. (I didn't analyze timing and ping performance yet.) - in six cases, 2% of all frequency changes, it seems that the firmware crashed :-( This caused the familiar station list with just one station. I sent a register dump to Atheros. (*) Perhaps the problem would also have disappeared just by waiting and the recovery action we took was in fact irrelevant. My script also uses "wait and see" as a recovery strategy, but I need a lot more data points before I can tell whether waiting alone is - at least technically - good enough. Conclusions so far: - we end up at the wrong frequency a lot more often than I would have expected, and even if two channels away from what the access point is using (*), things may not look bad at first sight. This needs more analysis. (*) At least I hope it is using the right frequencies. I don't have a spectrum analyzer, so I can't verify what really happens. - some roaming problems may be caused by frequency leakage making the WLAN module think all it well but in fact, it barely manages to get enough communication done to maintin this illusion. Also this still needs a quantitative analysis. - there's a firmware bug that may cause about half of all roaming problems. We could work around this by resetting the module when it trips, but I hope we can find a better solution. I'm now running a longer experiment that should turn up more data. - Werner _______________________________________________ devel mailing list devel@lists.openmoko.org https://lists.openmoko.org/mailman/listinfo/devel