[Nutch-dev] PruneIndexTool

Andrzej Bialecki Tue, 02 Nov 2004 14:01:38 -0800

Hello,

Attached is a tool to prune indexed segments of unwanted content.
Actually, just segment indexes are pruned - the segment content is still
there, it just doesn't show when running queries.

This tool helps you in a situation when you end up with unwanted content
in your segments, which you'd rather remove directly from the index
instead of changing the regex URL filters and re-fetching all
segments... There is a similar problem with having unwanted pages and
links in your WebDB - a companion tool to this with an unimaginative
name PruneDBTool is coming soon.

You need to put some queries in a text file, in order to catch what kind of content you want to remove. IMPORTANT: these queries need to be specified in the Lucene syntax, which is NOT the same as the Nutch syntax.

Also, there are some tricky issues with queries on "url" field, because of the special way it is split into tokens... E.g. the url "http://www.cnn.com"; is split into the following tokens:

http http-www www cnn com

As an example, you could put the following in your queries file (comments are allowed):

------------------------ queries.txt -------------------
# delete docs from www.cnn.com
url:"www cnn com"

# delete docs that contain "p0rn" in their content,
# but not "study" or "research", and which come from www.cnn.com
content:p0rn -content:(study research) +url:"www cnn com"

# delete docs in Swahili language
lang:sw

---------------------------------------------------------

Then you would execute it like this (for a dry run, for the real run
just omit this option and omit -showfields for performance reasons):

PruneIndexTool index -queries queries.txt -dryrun -showfields url,title


Please test and let me know if it works for you. If you find it useful,
I'll add it to the tools package.

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

/*
 * Created on Nov 2, 2004
 * Author: Andrzej Bialecki <[EMAIL PROTECTED]>
 *
 */
package net.nutch.tools;


import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.PrintStream;
import java.io.PrintWriter;
import java.io.Reader;
import java.util.BitSet;
import java.util.List;
import java.util.ArrayList;
import java.util.StringTokenizer;
import java.util.Vector;
import java.util.logging.Logger;

import net.nutch.indexer.IndexSegment;
import net.nutch.util.LogFormatter;
import net.nutch.util.NutchConf;

import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.HitCollector;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

/**
 * This tool prunes existing Nutch indexes of unwanted content. The main method
 * accepts a list of segment directories (containing indexes). These indexes will
 * be pruned of any content that matches a list of Lucene queries read from a file
 * (defined in standard config file, or explicitly overridden from command-line).
 * 
 * <p>NOTE 1: Queries are expressed in Lucene's QueryParser syntax, so a knowledge
 * of available document fields is required. This can be obtained by reading sources
 * of <code>index-basic</code> and <code>index-more</code> plugins, or using tools
 * like <a href="http://www.getopt.org/luke";>Luke</a>. During query parsing a
 * WhitespaceAnalyzer is used - this choice has been made to minimize side effects of
 * Analyzer on the final set of query terms. However, this means that to query for
 * a URL the query should look like this:
 * <blockquote>
 * <code>url:"http www getopt org"</code>
 * </blockquote>
 * If additional level of control is required, an instance of [EMAIL PROTECTED] PruneChecker} can
 * be provided to check each document before it's deleted.
 * </p>
 * <p>NOTE 2: This tool removes matching documents ONLY from segment indexes (or
 * from a merged index). In particular it does NOT remove the pages and links
 * from WebDB. This means that unwanted URLs may pop up again when new segments
 * are created. To prevent this, use RegexURLFilter expressions, or PruneDBTool.</p>
 * <p>NOTE 3: This tool uses a low-level Lucene interface to collect all matching
 * documents. For large indexes this may result in high memory consumption.</p>
 * 
 * @author Andrzej Bialecki &lt;[EMAIL PROTECTED]&gt;
 */
public class PruneIndexTool {
  public static final Logger LOG = LogFormatter.getLogger("net.nutch.tools.PruneIndexTool");
  public static int LOG_STEP = 50000;
  
  /**
   * This interface can be used to implement additional checking on matching
   * documents.
   * @author Andrzej Bialecki &lt;[EMAIL PROTECTED]&gt;
   */
  public static interface PruneChecker {
    /**
     * Check whether this document should be pruned.
     * @param reader index reader to read documents from
     * @param docNum document ID
     * @return true if the document should be deleted, false otherwise.
     */
    public boolean isPrunable(IndexReader reader, int docNum) throws Exception;
  }

  /**
   * This checker's main function is to just print out
   * selected field values from each document.
   * 
   * @author Andrzej Bialecki &lt;[EMAIL PROTECTED]&gt;
   */
  public static class PrintFieldsChecker implements PruneChecker {
    private PrintStream ps = null;
    private String[] fields = null;
    
    public PrintFieldsChecker(PrintStream ps, String[] fields) {
      this.ps = ps;
      this.fields = fields;
    }
    
    public boolean isPrunable(IndexReader reader, int docNum) throws Exception {
      Document doc = reader.document(docNum);
      StringBuffer sb = new StringBuffer("#" + docNum + ":");
      for (int i = 0; i < fields.length; i++) {
        String[] values = doc.getValues(fields[i]);
        sb.append(" " + fields[i] + "=");
        if (values != null) {
          for (int k = 0; k < values.length; k++) {
            sb.append("[" + values[k] + "]");
          }
        } else sb.append("[null]");
      }
      ps.println(sb.toString());
      return true;
    }
  }
  private Query[] queries = null;
  private IndexReader reader = null;
  private IndexSearcher searcher = null;
  private PruneChecker checker = null;
  private boolean dryrun = false;
  
  /**
   * Create an instance of the tool, and open all input indexes.
   * @param indexDirs directories with input indexes. At least one valid index must
   * exist, otherwise an Exception is thrown.
   * @param queries pruning queries. Each query will be processed in turn, and the
   * length of the array must be at least one, otherwise an Exception is thrown.
   * @param checker if not null, this instance will be used to perform additional
   * checks on matching documents.
   * @param unlock if true, and if any of the input indexes is locked, forcibly
   * unlock it. Use with care, only when you are sure that other processes don't
   * modify the index at the same time.
   * @param dryrun if set to true, only log actions to be performed, but don't change
   * the index. If false, perform all actions, changing indexes as needed.
   * @throws Exception
   */
  public PruneIndexTool(File[] indexDirs, Query[] queries, PruneChecker checker,
          boolean unlock, boolean dryrun) throws Exception {
    if (indexDirs == null || queries == null)
      throw new Exception("Invalid arguments.");
    if (indexDirs.length == 0 || queries.length == 0)
      throw new Exception("Nothing to do.");
    this.queries = queries;
    this.checker = checker;
    this.dryrun = dryrun;
    int numIdx = 0;
    if (indexDirs.length == 1) {
      Directory dir = FSDirectory.getDirectory(indexDirs[0], false);
      if (IndexReader.isLocked(dir)) {
        if (!unlock) {
          throw new Exception("Index " + indexDirs[0] + " is locked.");
        }
        if (!dryrun) {
          IndexReader.unlock(dir);
          LOG.info(" - had to unlock index in " + dir);
        }
      }
      reader = IndexReader.open(dir);
      numIdx = 1;
    } else {
      Directory dir;
      Vector indexes = new Vector(indexDirs.length);
      for (int i = 0; i < indexDirs.length; i++) {
        try {
          dir = FSDirectory.getDirectory(indexDirs[i], false);
          if (IndexReader.isLocked(dir)) {
            if (!unlock) {
              LOG.warning("Index " + indexDirs[i] + " is locked. Skipping...");
              continue;
            }
            if (!dryrun) {
              IndexReader.unlock(dir);
              LOG.info(" - had to unlock index in " + dir);
            }
          }
          IndexReader r = IndexReader.open(dir);
          indexes.add(r);
          numIdx++;
        } catch (Exception e) {
          LOG.warning("Invalid index in " + indexDirs[i] + " - skipping...");
        }
      }
      if (indexes.size() == 0) throw new Exception("No input indexes.");
      IndexReader[] readers = (IndexReader[])indexes.toArray(new IndexReader[0]);
      reader = new MultiReader(readers);
    }
    LOG.info((dryrun ? " [DRY RUN] " : "") + "Opened " + numIdx + " index(es) with total " + reader.numDocs() + " documents.");
    searcher = new IndexSearcher(reader);
  }
  
  /**
   * This class collects all matching document IDs in a BitSet.
   * <p>NOTE: the reason to use this API is that the most common way of
   * performing Lucene queries (Searcher.search(Query)::Hits) does NOT
   * return all matching documents, because it skips very low scoring hits.</p>
   * 
   * @author Andrzej Bialecki &lt;[EMAIL PROTECTED]&gt;
   */
  private static class AllHitsCollector extends HitCollector {
    private BitSet bits;
    
    public AllHitsCollector(BitSet bits) {
      this.bits = bits;
    }
    public void collect(int doc, float score) {
      bits.set(doc);
    }
  }
  
  /**
   * For each query, find all matching documents and delete them from all input
   * indexes. Optionally, an additional check can be performed
   */
  public void run() {
    BitSet bits = new BitSet(reader.maxDoc());
    AllHitsCollector ahc = new AllHitsCollector(bits);
    for (int i = 0; i < queries.length; i++) {
      LOG.info((dryrun ? " [DRY RUN] " : "") + "Processing query: " + queries[i].toString());
      bits.clear();
      try {
        searcher.search(queries[i], ahc);
      } catch (IOException e) {
        LOG.warning((dryrun ? " [DRY RUN]" : "") + " - failed: " + e.getMessage());
        continue;
      }
      if (bits.cardinality() == 0) {
        LOG.info((dryrun ? " [DRY RUN]" : "") + " - no matching documents.");
        continue;
      }
      LOG.info((dryrun ? " [DRY RUN]" : "") + " - found " + bits.cardinality() + " document(s).");
      // Now delete all matching documents
      int docNum = -1, start = 0, cnt = 0;
      // probably faster than looping through all indexes?
      while ((docNum = bits.nextSetBit(start)) != -1) {
        try {
          if (checker != null) {
            if (checker.isPrunable(reader, docNum))
              if (!dryrun) reader.delete(docNum);
          } else if (!dryrun) reader.delete(docNum);
          cnt++;
        } catch (Exception e) {
          LOG.warning((dryrun ? " [DRY RUN]" : "") + " - failed to delete doc #" + docNum);
        }
        start = docNum + 1;
      }
      LOG.info((dryrun ? " [DRY RUN]" : "") + " - deleted " + cnt + " document(s).");
    }
    try {
      reader.close();
    } catch (IOException e) {
      LOG.warning("Exception when closing reader(s): " + e.getMessage());
    }
  }
  
  public static void main(String[] args) throws Exception {
    if (args.length == 0) {
      usage();
      LOG.severe("Missing arguments");
      return;
    }
    File idx = new File(args[0]);
    if (!idx.isDirectory()) {
      usage();
      LOG.severe("Not a directory: " + idx);
      return;
    }
    Vector paths = new Vector();
    if (IndexReader.indexExists(idx)) {
      paths.add(idx);
    } else {
      // try and see if there are segments inside, with index dirs
      File[] dirs = idx.listFiles(new FileFilter() {
        public boolean accept(File f) {
          return f.isDirectory();
        }
      });
      if (dirs == null || dirs.length == 0) {
        usage();
        LOG.severe("No indexes in " + idx);
        return;
      }
      for (int i = 0; i < dirs.length; i++) {
        File sidx = new File(dirs[i], "index");
        if (sidx.exists() && sidx.isDirectory() && IndexReader.indexExists(sidx)) {
          paths.add(sidx);
        }
      }
      if (paths.size() == 0) {
        usage();
        LOG.severe("No indexes in " + idx + " or its subdirs.");
        return;
      }
    }
    File[] indexes = (File[])paths.toArray(new File[0]);
    boolean force = false;
    boolean dryrun = false;
    String qPath = null;
    String fList = null;
    for (int i = 1; i < args.length; i++) {
      if (args[i].equals("-force")) {
        force = true;
      } else if (args[i].equals("-queries")) {
        qPath = args[++i];
      } else if (args[i].equals("-showfields")) {
        fList = args[++i];
      } else if (args[i].equals("-dryrun")) {
        dryrun = true;
      } else {
        usage();
        LOG.severe("Unrecognized option: " + args[i]);
        return;
      }
    }
    PruneChecker pc = null;
    if (fList != null) {
      StringTokenizer st = new StringTokenizer(fList, ",");
      Vector tokens = new Vector();
      while (st.hasMoreTokens()) tokens.add(st.nextToken());
      String[] fields = (String[])tokens.toArray(new String[0]);
      pc = new PrintFieldsChecker(System.out, fields);
    }

    Query[] queries = null;
    InputStream is = null;
    if (qPath != null) {
      is = new FileInputStream(qPath);
    } else {
      qPath = NutchConf.get("prune.index.tool.queries");
      is = NutchConf.getConfResourceAsInputStream(qPath);
    }
    if (is == null) {
      LOG.severe("Can't load queries from " + qPath);
      return;
    }
    try {
      queries = parseQueries(is);
    } catch (Exception e) {
      LOG.severe("Error parsing queries: " + e.getMessage());
      return;
    }
    try {
      PruneIndexTool pit = new PruneIndexTool(indexes, queries, pc, true, dryrun);
      pit.run();
    } catch (Exception e) {
      LOG.severe("Error running PruneIndexTool: " + e.getMessage());
      return;
    }
  }
  
  public static Query[] parseQueries(InputStream is) throws Exception {
    BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
    String line = null;
    QueryParser qp = new QueryParser("url", new WhitespaceAnalyzer());
    Vector queries = new Vector();
    while ((line = br.readLine()) != null) {
      line = line.trim();
      //skip blanks and comments
      if (line.length() == 0 || line.charAt(0) == '#') continue;
      Query q = qp.parse(line);
      queries.add(q);
    }
    return (Query[])queries.toArray(new Query[0]);
  }
  
  private static void usage() {
    System.err.println("PruneIndexTool <indexDir | segmentsDir> [-dryrun] [-force] [-queries filename] [-showfields field1,field2,field3...]");
    System.err.println("\tNOTE: exactly one of <indexDir> or <segmentsDir> MUST be provided!\n");
    System.err.println("\t-dryrun\t\t\tdon't do anything, just show what would be done.");
    System.err.println("\t-force\t\t\tforce index unlock, if locked. Use with caution!");
    System.err.println("\t-queries filename\tread pruning queries from this file, instead of the");
    System.err.println("\t\t\t\tdefault defined in Nutch config files under 'prune.index.tool.queries' key.\n");
    System.err.println("\t-showfields field1,field2...\tfor each deleted document show the values of the selected fields.");
    System.err.println("\t\t\t\tNOTE 1: this will slow down processing by orders of magnitude.");
    System.err.println("\t\t\t\tNOTE 2: only values of stored fields will be shown.");
  }
}

[Nutch-dev] PruneIndexTool

Reply via email to