Re: % of different content types out there on the web

Markus Jelsma Tue, 31 Jan 2012 04:42:47 -0800

We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on 
those two. However, we also explicitly filter out all/most unwanted suffixes. 
We do have a lot of suffixes that we encountered so far.


On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
> (sorry for the cross post)
> 
> Hey Guys,
> 
> I'm trying to find a good citation or estimate (if anyone has done one)
> that estimates the breakout (by % or some other metric) of content types
> out there out the web (with a whole web crawl or a meaningful
> representative dataset) that are non HTML.
> 
> Anyone have any ideas about this?
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-- 
Markus Jelsma - CTO - Openindex

Re: % of different content types out there on the web

Reply via email to