We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on
those two. However, we also explicitly filter out all/most unwanted suffixes.
We do have a lot of suffixes that we encountered so far.
On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
(sorry for
Hi Markus,
Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes
compared
to the size of the entire corpus?
Cheers,
Chris
On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote:
We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data
on
those two.
On Tuesday 31 January 2012 15:55:06 Mattmann, Chris A (388J) wrote:
Hi Markus,
Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes
compared to the size of the entire corpus?
Unfortunately no, we don't keep record of those, just filter them away as soon
as wel can.
That could be an interesting experiment to do with the commoncrawl dataset
and Tika on Behemoth. Assuming of course that the detection is done
correctly by Tika. Does anyone have a spare cluster on EC2 ;-) ?
Julien
On 28 January 2012 02:01, Mattmann, Chris A (388J)
Hello Chris,
In the Portuguese Web Archive we did a study of web characteristics
for the portuguese web. I don't know if this helps you but where is
the papper.
João Miranda, Daniel Gomes, Trends in Web characteristics (best paper
award: 2nd place), 7th Latin American Web Congress, Merida,
5 matches
Mail list logo