Re: [Haskell-cafe] Re: String vs ByteString

wren ng thornton Tue, 17 Aug 2010 19:13:06 -0700

Johan Tibell wrote:

To my knowledge the data we have about prevalence of encoding on the web is
accurate. We crawl all pages we can get our hands on, by starting at some
set of seeds and then following all the links. You cannot be sure that
you've reached all web sites as there might be cliques in the web graph but
we try our best to get them all. You're unlikely to get a better estimate
anywhere else. I doubt few organizations have the machinery required to
crawl most of the web.

There was a study recently on this. They found that there are four mainparts of the Internet:


* a densely connected core, where from any site you can get to any other

* an "in cone", from which you can reach the core (but not other in-conemembers, since then you'd both be in the core)* an "out cone", which can be reached from the core (but which cannotreach each other)

* and, unconnected islands

The surprising part is they found that all four parts are approximatelythe same size. I forget the exact numbers, but they're all 25+/-5%.

This implies that an exhaustive crawl of the web would require havingabout 50% of all websites as seeds (the in-cone plus the islands). Ifwe're only interested in a representative sample, then we could get bywith fewer. However, that depends a lot on the definition of"representative". And we can't have an accurate definition ofrepresentative without doing the entire crawl at some point in order todiscover the appropriate distributions. Then again, distributions changeover time...

Thus, I would guess that Google only has 50~75% of the net: the core,the out-cone, and a fraction of the islands and in-cone.


--
Live well,
~wren
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: String vs ByteString

Reply via email to