We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on those two. However, we also explicitly filter out all/most unwanted suffixes. We do have a lot of suffixes that we encountered so far.
On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote: > (sorry for the cross post) > > Hey Guys, > > I'm trying to find a good citation or estimate (if anyone has done one) > that estimates the breakout (by % or some other metric) of content types > out there out the web (with a whole web crawl or a meaningful > representative dataset) that are non HTML. > > Anyone have any ideas about this? > > Thanks! > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -- Markus Jelsma - CTO - Openindex

