Re: Extensive web crawl

Andrzej Bialecki Mon, 20 Oct 2008 15:29:17 -0700

Webmaster wrote:

Hi Otis,


So far so good..


[...]

Thank you for sharing this information with us! This sounds exciting.

When this next round of fetching is done I'm going to inject 10m valid urls
from my fresh fetch lists and crawl to a depth of 10 to see what happens.
My guess is it will return about 200m urls, this should be an adequate
stress test of my sad cluster of outdated machines :)

I am however still looking into filtering the results for adult content
before I move it off the hadoop cluster and put it on the distributed search
nodes local file systems.

In my experience a simple word-list based classifier works well enough,or a rule-based classifier similar in concept to SpamAssassin. Thistends to mark 80+ % of adult pages. You may want to keep them in CrawlDbto prevent their re-discovery, just mark them with something thatprevents indexing and/or generating.

Also, often a good strategy to collect higher-quality pages first is toconcentrate only on plain-looking URLs, i.e. without too many strangecharacters, or all-numerical subdirectories, or too many non-lettercharacters.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Extensive web crawl

Reply via email to