I'm working with Bryan Woliner on developing a tool to prune indexed pages based on matching their urls with regular expressions. Since I'm relatively new to Nutch, I was wondering if I could get some feedback from you more experienced users on my ideas so far.
I've been looking through the source of PruneIndexTool and the urlfilter-regex plugin, and it seems like I'll be able to use a good bit of the code in there as a starting point. PruneIndexTool defines an interface called PruneChecker, which is an object designed to sort through indexed pages and decide whether or not they should be pruned given its own criteria. I was thinking to create a RegexUrlChecker that implements this interface and decides to keep or reject pages based on their matches with a regex file. It looks like I could use a pair of static functions defined in the urlfilter-regex plugin to read in a regex file in the same format as regex-urlfilter.txt and return a list of regex rules. Here's the basic idea of what I'm thinking, in Java-ish pseudocode: public class RegexUrlChecker implements PruneChecker{ RegexUrlChecker(Reader regexFile){ rules = RegexUrlFilter.readConfigurationFile(regexFile); } public boolean isPrunable(Query q, IndexReader reader, int docNum) throws Exception{ Document doc = reader.document(docNum); String url = doc.get("url"); // check the url against the rules Iterator i=rules.iterator(); while(i.hasNext()) { Rule r = (Rule) i.next(); if (matcher.contains(url,r.pattern)) { return r.sign ? true: false; } } return false; // didn't match any of the rules, so reject it } public void close{ // whatever cleanup we need here } private PatternMatcher matcher = new Perl5Matcher(); private List rules; // list of regex Rules against which to check the pages' urls } It seems like the best way to use this would be to create an instance of a PruneIndexTool, passing in our RegexUrlChecker as the single element of the PruneChecker array and using the Lucene query "url:http" to return all of the indexed urls. What do you all think of this approach? Any suggestions or comments? Thanks! -- Thomas Mayfield ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general