Got it.  Excellent theory.

I guess my question is: why bother with the whole neighbor-cache thing at
all?  Why not stick with a classic IM architecture where clients just
maintain persistent TCP connections with the back end?  I mean, there are IM
networks *far* bigger than Skype that deal with the same massive-restart
problems all the time.  

So far as I can tell, the only thing unique about Skype versus a classic IM
network is Skype does a p2p relay service, but even that could be organized
in a centralized fashion far easier than with some p2p approach.

It'd be interesting to do an analysis on uptime of AIM versus Skype --
recognizing that AIM is 10x bigger.  I don't use Skype enough to really have
an opinion on its uptime, but AIM seems pretty good: flaky on occasion, but
generally quite reliable.

I'd love to see more rigorous numbers on that, especially given that Skype
has been widely celebrated for "better scalability and reliability through
the magic of p2p!"  I wonder if it in fact achieves either of those.

-david

> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:p2p-hackers-
> [EMAIL PROTECTED] On Behalf Of Alexander Pevzner
> Sent: Tuesday, August 21, 2007 4:02 PM
> To: [email protected]
> Subject: Re: [p2p-hackers] what really happened to Skype?
> 
> Hi, David,
> 
> the Skype was not down completely all of this time. When it started,
> Skype more or less worked, but clients periodically lost connection to
> the server. Then situation began to become worse: the average period of
> disconnected state grown, the period of connected state shortened. In a
> worst state clients were able to connect for few seconds few times a
> day. The situation didn't recover one happy moment. Instead, it
> developed in a opposite direction: the clients were able to login more
> often and stay online for a longer time.
> 
> I run 2 Skype clients on a different machines on a same time. There was
> absolutely no correlation between them when they were logged in or not.
> One of them was able to call Skype voice test, when other was unable to
> login.
> 
> It brings me to conclusion that server was not down completely, but
> rejected (or was unable to complete in time) most of the client's
> requests performed during login process.
> 
> Please note also 2 more facts:
>   1) This is not a first time, when millions of Windows users reboot
> their machines due to Windows upgrade. But this is a first time when it
> crashed a Skype network
>   2) The situation was fixed at the server side. Client upgrade was not
> required
> 
> These observations together with Skype's claims that the problem is in
> the p2p algorithm leads me to the following hypothesis. Skype client,
> being online, tracks state of ~20 neighbors. There is a neighbor cache
> on a client, which only needs to be reloaded from the server when either
> client is new or nobody in a cache is online. Probably, this operation
> is not cheap for the server. When critical amount of clients becomes
> offline, too many orphaned clients try do update neighbor cache from a
> server, which leads to server overload, server becomes incapable to
> maintain already established connections, more clients leave the network
> and so on.
> 
> I guess they did one of the following:
>   1) Optimized server's algorithms to fix the server-side bottleneck
>   2) Made a server-side limit of how many clients may connect per
> minute, to stabilize a network and reduce server load
> 
> David Barrett wrote:
> > Reading the Skype blog post, it doesn't sound like the problem was due
> to
> > lack of central resources, but rather some catastrophic bug in the P2P
> > network itself -- like it was unable to reform the DHT (or whatever) in
> the
> > wake of a massive churn event.
> >
> > I mean, it shouldn't take *2 days* to log in 9 million users, and unless
> > this was coupled with servers actually suffering hardware malfunction
> (of
> > which there's no indication), I can't see any reason why it'd take that
> long
> > to simply deal with a big backlog of authentication requests.
> >
> > Does anyone know much about the Skype P2P/DHT/network algorithm, and can
> > they hypothesize what sort of event could cause it to take so long to
> get
> > back into operation?
> >
> > Also, was Skype 100% down for 2 days followed by it coming 100% back up
> > (indicative of a central server problem), or was it suffering from
> varying
> > levels of failure throughout that took a couple days to clear up
> (suggesting
> > P2P network problems)?
> >
> > -david
> >
> >> -----Original Message-----
> >> From: [EMAIL PROTECTED] [mailto:p2p-hackers-
> >> [EMAIL PROTECTED] On Behalf Of Alen Peacock
> >> Sent: Monday, August 20, 2007 1:39 PM
> >> To: theory and practice of decentralized computer networks
> >> Subject: Re: [p2p-hackers] what really happened to Skype?
> >>
> >> Absolutely true that Skype hasn't given us enough details to figure
> >> out exactly what happened or why, but that doesn't prevent the looser
> >> cannons among us from taking a shot:
> >> http://flud.org/blog/2007/08/20/p2ps-skype-induced-blackeye-or-why-
> >> diversity-is-good/
> >>
> >> Alen
> >>
> >>
> >> On 8/20/07, zooko <[EMAIL PROTECTED]> wrote:
> >>> Folks:
> >>>
> >>> This is a fascinating case study, but we don't yet have enough
> >>> information to really learn from it!
> >>>
> >>> http://heartbeat.skype.com/2007/08/what_happened_on_august_16.html
> >> _______________________________________________
> >> p2p-hackers mailing list
> >> [email protected]
> >> http://lists.zooko.com/mailman/listinfo/p2p-hackers
> 
> _______________________________________________
> p2p-hackers mailing list
> [email protected]
> http://lists.zooko.com/mailman/listinfo/p2p-hackers

_______________________________________________
p2p-hackers mailing list
[email protected]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to