On Tuesday 31 January 2012 15:55:06 Mattmann, Chris A (388J) wrote: > Hi Markus, > > Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes > compared to the size of the entire corpus?
Unfortunately no, we don't keep record of those, just filter them away as soon as wel can. > > Cheers, > Chris > > On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote: > > We only crawl HTML and PDF files for a lot of cc-TLD's so we only have > > data on those two. However, we also explicitly filter out all/most > > unwanted suffixes. We do have a lot of suffixes that we encountered so > > far. > > > > On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote: > >> (sorry for the cross post) > >> > >> Hey Guys, > >> > >> I'm trying to find a good citation or estimate (if anyone has done one) > >> that estimates the breakout (by % or some other metric) of content types > >> out there out the web (with a whole web crawl or a meaningful > >> representative dataset) that are non HTML. > >> > >> Anyone have any ideas about this? > >> > >> Thanks! > >> > >> Cheers, > >> Chris > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Senior Computer Scientist > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 171-266B, Mailstop: 171-246 > >> Email: [email protected] > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Adjunct Assistant Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -- Markus Jelsma - CTO - Openindex

