Hi All, This is a note on entering the regex strings in the Crawl URL Filter text (crawl-urlfilter.txt) file.
Make sure that you enter the exclusion "-" strings before the inclusion "+" strings. RegexURLFIlter does the regex pattern matching from top to bottom, and if there is a match then that takes precedence. In such a case, if you have the inclusion pattern first then the exclusion patterns following it would not take effect. For example: if you have the entries like below: +^http://xyz.com/doc -^http://xyz.com/doc/new then the 'new' exclusion will never take effect, as the doc matching takes precedence. Regards, Ravi Chintakunta ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
