Webmaster wrote:
I'm thinking I do not want any adult content at all in my system.  There's
more than enough of that out there on other engines.  Preferably I think I
would like to use regex to filter the content while crawling and then use an
additional filter combination within the search server itself to ensure that
no adult content.

Now will adding these 1000+ terms to regex slow down the parsing of the urls
excessively?

If you want to use regexes - then yes, likely this will cause significant slowdowns. However, there are other ways to implement membership tests, such as prefix trees and Bloom filters. We already have a prefix tree implementation in Nutch, and HBase has a very good implementation of Bloom filters.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to