Re: Extensive web crawl

Andrzej Bialecki Wed, 22 Oct 2008 00:21:12 -0700

Webmaster wrote:

I'm thinking I do not want any adult content at all in my system.  There's
more than enough of that out there on other engines.  Preferably I think I
would like to use regex to filter the content while crawling and then use an
additional filter combination within the search server itself to ensure that
no adult content.


Now will adding these 1000+ terms to regex slow down the parsing of the urls
excessively?

If you want to use regexes - then yes, likely this will causesignificant slowdowns. However, there are other ways to implementmembership tests, such as prefix trees and Bloom filters. We alreadyhave a prefix tree implementation in Nutch, and HBase has a very goodimplementation of Bloom filters.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Extensive web crawl

Reply via email to