I'm working with Bryan Woliner on developing a tool to prune indexed
pages based on matching their urls with regular expressions.  Since I'm
relatively new to Nutch, I was wondering if I could get some feedback
from you more experienced users on my ideas so far.

I've been looking through the source of PruneIndexTool and the
urlfilter-regex plugin, and it seems like I'll be able to use a good bit
of the code in there as a starting point.  PruneIndexTool defines an
interface called PruneChecker, which is an object designed to sort
through indexed pages and decide whether or not they should be pruned
given its own criteria.  I was thinking to create a RegexUrlChecker that
implements this interface and decides to keep or reject pages based on
their matches with a regex file.  It looks like I could use a pair of
static functions defined in the urlfilter-regex plugin to read in a
regex file in the same format as regex-urlfilter.txt and return a list
of regex rules.  Here's the basic idea of what I'm thinking, in Java-ish
pseudocode:  

public class RegexUrlChecker implements PruneChecker{         
     RegexUrlChecker(Reader regexFile){                   
         rules = RegexUrlFilter.readConfigurationFile(regexFile);         
     }

     public boolean isPrunable(Query q, IndexReader reader, int docNum)
throws Exception{
         Document doc = reader.document(docNum);                   
         String url = doc.get("url");                    
                   
         // check the url against the rules
         Iterator i=rules.iterator();
         while(i.hasNext()) {
             Rule r = (Rule) i.next();
             if (matcher.contains(url,r.pattern)) { 
                  return r.sign ? true: false;
             }
         }
         return false; // didn't match any of the rules, so reject it
     }

     public void close{           
          // whatever cleanup we need here         
     }
     
     private PatternMatcher matcher = new Perl5Matcher();
     private List rules;  // list of regex Rules against which to check
the pages' urls 
}

It seems like the best way to use this would be to create an instance of
a PruneIndexTool, passing in our RegexUrlChecker as the single element
of the PruneChecker array and using the Lucene query "url:http" to
return all of the indexed urls.  What do you all think of this approach?
 Any suggestions or comments?  Thanks!

-- Thomas Mayfield


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to