Webmaster wrote:
I'm thinking I do not want any adult content at all in my system. There's
more than enough of that out there on other engines. Preferably I think I
would like to use regex to filter the content while crawling and then use an
additional filter combination within the search server itself to ensure that
no adult content.
Now will adding these 1000+ terms to regex slow down the parsing of the urls
excessively?
If you want to use regexes - then yes, likely this will cause
significant slowdowns. However, there are other ways to implement
membership tests, such as prefix trees and Bloom filters. We already
have a prefix tree implementation in Nutch, and HBase has a very good
implementation of Bloom filters.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com