> The goal is to > avoid entering 100,000 regex in the > craw-urlfilter.xml and checking ALL > these regex for each URL. Any comment?
Sure seems like just some hash look up table could handle it. I am having a hard time seeing when you really need a regex and a fixed list wouldn't do. Especially if you have forward and maybe a backwards lookup as well in a multi-level hash, to perhaps include/exclude at a certain sudomain level, like include: com->site->good (for good.site.com stuff) exclude: com->site->bad (for bad.site.com) and kind of walk backwards, kind of like dns. Then you could just do a few hash lookups instead of 100,000 regexes. I realize I am talking about host and not page level filtering, but if you want to include everything from your 100,000 sites, I think such a strategy could work. Hope this makes sense. Maybe I could write some code to and see if it works in practice. If nothing else, maybe the hash stuff could just be another filter option in conf/crawl-urlfilter.txt. Earl __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
