Nutchers,

I know that I have seen many posts to this list regarding the usage of
nutch's prune tool (org.apache.nutch.tools.PruneIn dexTool) and that many of
those posts noted the difficulty of having to pass Lucene queries as
parameters (for those of us who don't already have a firm understanding of
Lucene queries). I also know that many users wish that there was a pruning
tool that could prune an index based on regex commands (like those used by
regex-urlfilter.txt).

A while back a friend of mine who is more java savy that I am (I am not java
savy at all) wrote such at tool, which we called PruneRegexTool. The reason
for this post is twofold: first to share this tool with the nutch community
and second, to ask for help in updating this tool to work with nutch 0.8.1,
since it currently only works with 0.7.1. I'm not exactly sure why it
doesn't work with 0.8.1, but I think of at least two probably reasons: this
tool uses Lucene 1.4.3, which is currently outdated, and the tool is based
on the PruneIndexTool, which probably changed between nutch 0.7.1 and nutch
0.8.1.

As I have already noted, my understanding of java is quite poor, so I don't
know if updating this tool for use with nutch 0.8.1 is a 20 minute or 20
hour task (I'm guessing somewhere inbetween). Anyway, I have included the
.java and .class file for this tool (as attachments), as well as the
instructions that my friend provided me regarding how to use it.

If anyone updates it for use with nutch 0.8.1, I would be eternally grateful
to get a copy of the updated code, as well as any changes in the
instructions for how to use this tool properly.

Final Disclaimer: I cannot make any assurance about this tool and the
instructions provided, except that it worked fine for me when I was still
using nutch 0.7.1.

Instructions:

*Stage 1:* Compiling PruneRegexTool

Everything in this stage except for the very last step (running javac on
PruneRegexTool.java) only needs to be done once. The last step only needs to
be done again if changes are made to PruneRegexTool.java.

If you haven't already, make sure that the java compiler is in your PATH. If
not, you can add to by putting the following in your ~/.bash_profile:

  - PATH=$PATH:/usr/local/j2sdk1.4.2_08/bin/
  - export PATH

Then run:

  - source ~/.bash_profile

to make the change take effect for your current session. (The ~ is an alias
for your home directory, so you can run this command from anywhere.)

PruneRegexTool uses code from Nutch and two other libraries that are not
included with the Nutch package: Lucene and ORO. Download the source code
for these libraries into your nutch-0.7.1 directory with the commands (run
from within the nutch directory):

  - wget
  http://apache.mirrors.pair.com/jakarta/lucene/source/lucene-1.4.3-src.tar.gz
  - wget
  
http://apache.mirrors.versehost.com/jakarta/oro/source/jakarta-oro-2.0.8.tar.gz

And then extract them with:

  - tar xzvf lucene-1.4.3-src.tar.gz
  - tar xzvf jakarta-oro-2.0.8.tar.gz

Now that we've got these libraries set up, we need to tell the java compiler
where they (and Nutch's own source code) live. This is done by setting the
CLASSPATH environmental variable. Add the following to your ~/.bash_profile,
replacing USERNAME with your username:

CLASSPATH=/home/USERNAME/nutch-0.7.1/jakarta-oro-2.0.8
/src/java:/home/USERNAME/nutch-0.7.1/lucene-1.4.3
/src/java:/home/USERNAME/nutch-0.7.1/src/java:/home/USERNAME/nutch-0.7.1
/src/plugin/urlfilter-regex/src/java:.

export CLASSPATH

Next, copy PruneRegexTool.java and PruneRegexTool.class to your nutch
directory.

You should then be able to compile the tool with the command:

javac PruneRegexTool.java
*
Stage 2:* Set Up conf/regex-prune.txt

The file conf/regex-prune.txt is the default file that PruneRegexTool reads
to determine which pages to keep and which to discard based on their urls.
It uses the same format as conf/regex-urlfilter.txt above, with + indicating
that a url will be pruned, and - indicating that it will not be pruned. A
custom file can be specified when running the tool with the flag -regexfile
filename.

*Stage 3:* Run The Tool

The full set of command line arguments that can be given to the tool are as
follows:

  - PruneRegexTool <indexDir | segmentsDir> [-regexfile filename]
  [-dryrun] [-force] [-output filename]
  - NOTE: exactly one of <indexDir> or <segmentsDir> must be provided
  - -*regexfile * specify the file containing the regex to be used
  during pruning. defaults to conf/regex-prune.txt
  - -*dryrun* don't do anything, just show what would be done.
  - -*force* force index unlock, if locked. Use with caution!
  - -*output* store pruned URLs in a text file


An example run of the tool (from your nutch-0.7.1directory):

  - bin/nutch PruneRegexTool crawl/segments -dryrun -output
  prunedurls.log

End of Instructions

Again, if anyone who uses nutch 0.8.1 thinks this tool would be useful, I
would love for you to update it so it works with nutch 0.8.1. For those of
you using nutch 0.7.1, I hope you find it useful in its current form.

-Bryan
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to