Webmaster wrote:
Hi Otis,
So far so good..
[...]
Thank you for sharing this information with us! This sounds exciting.
When this next round of fetching is done I'm going to inject 10m valid urls
from my fresh fetch lists and crawl to a depth of 10 to see what happens.
My guess is it will return about 200m urls, this should be an adequate
stress test of my sad cluster of outdated machines :)
I am however still looking into filtering the results for adult content
before I move it off the hadoop cluster and put it on the distributed search
nodes local file systems.
In my experience a simple word-list based classifier works well enough,
or a rule-based classifier similar in concept to SpamAssassin. This
tends to mark 80+ % of adult pages. You may want to keep them in CrawlDb
to prevent their re-discovery, just mark them with something that
prevents indexing and/or generating.
Also, often a good strategy to collect higher-quality pages first is to
concentrate only on plain-looking URLs, i.e. without too many strange
characters, or all-numerical subdirectories, or too many non-letter
characters.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com