>  The goal is to 
> avoid entering 100,000 regex in the
> craw-urlfilter.xml and checking ALL 
> these regex for each URL. Any comment?

Sure seems like just some hash look up table could
handle it.  I am having a hard time seeing when you
really need a regex and a fixed list wouldn't do. 
Especially if you have forward and maybe a backwards
lookup as well in a multi-level hash, to perhaps
include/exclude at a certain sudomain level, like

include: com->site->good (for good.site.com stuff)
exclude: com->site->bad (for bad.site.com)

and kind of walk backwards, kind of like dns.  Then
you could just do a few hash lookups instead of
100,000 regexes.

I realize I am talking about host and not page level
filtering, but if you want to include everything from
your 100,000 sites, I think such a strategy could
work.

Hope this makes sense.  Maybe I could write some code
to and see if it works in practice.  If nothing else,
maybe the hash stuff could just be another filter
option in conf/crawl-urlfilter.txt.

Earl

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to