Re: % of different content types out there on the web

2012-01-31 Thread Markus Jelsma
We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on those two. However, we also explicitly filter out all/most unwanted suffixes. We do have a lot of suffixes that we encountered so far. On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote: (sorry for

Re: % of different content types out there on the web

2012-01-31 Thread Mattmann, Chris A (388J)
Hi Markus, Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes compared to the size of the entire corpus? Cheers, Chris On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote: We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on those two.

Re: % of different content types out there on the web

2012-01-31 Thread Markus Jelsma
On Tuesday 31 January 2012 15:55:06 Mattmann, Chris A (388J) wrote: Hi Markus, Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes compared to the size of the entire corpus? Unfortunately no, we don't keep record of those, just filter them away as soon as wel can.

Re: % of different content types out there on the web

2012-01-29 Thread Julien Nioche
That could be an interesting experiment to do with the commoncrawl dataset and Tika on Behemoth. Assuming of course that the detection is done correctly by Tika. Does anyone have a spare cluster on EC2 ;-) ? Julien On 28 January 2012 02:01, Mattmann, Chris A (388J)

Re: % of different content types out there on the web

2012-01-28 Thread Simão Fontes
Hello Chris, In the Portuguese Web Archive we did a study of web characteristics for the portuguese web. I don't know if this helps you but where is the papper. João Miranda, Daniel Gomes, Trends in Web characteristics (best paper award: 2nd place), 7th Latin American Web Congress, Merida,