>  The goal is to 
> avoid entering 100,000 regex in the
> craw-urlfilter.xml and checking ALL 
> these regex for each URL. Any comment?

Sure seems like just some hash look up table could
handle it.  I am having a hard time seeing when you
really need a regex and a fixed list wouldn't do. 
Especially if you have forward and maybe a backwards
lookup as well in a multi-level hash, to perhaps
include/exclude at a certain sudomain level, like

include: com->site->good (for good.site.com stuff)
exclude: com->site->bad (for bad.site.com)

and kind of walk backwards, kind of like dns.  Then
you could just do a few hash lookups instead of
100,000 regexes.

I realize I am talking about host and not page level
filtering, but if you want to include everything from
your 100,000 sites, I think such a strategy could
work.

Hope this makes sense.  Maybe I could write some code
to and see if it works in practice.  If nothing else,
maybe the hash stuff could just be another filter
option in conf/crawl-urlfilter.txt.

Earl

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to