I added this statement right after

 *# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to
$NUTCH_HOME/conf*CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}

Then if you run nutch org.apache.nutch.net.RegexURLFilter you should be able
to test your urls.

I am still not completely satisfied with this answer though. The nutch
script contains the following statement:

 *# add plugins to classpath**if** [* -d "$NUTCH_HOME/plugins"* ]*; *then*
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME*fi*


What is this statement supposed to accomplish? Shouldn't this read something
like

*# add plugins to classpath*
*if** [* -d "$NUTCH_HOME/plugins"* ]*; *then*
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/**/*.jar
*fi*

'Cause now the plugins are not in the classpath, also not after running ant
compile or ant jar,  that copies to $NUTCH_HOME/build/plugins

Rgrds, Thomas



On 12/1/05, Bryan Woliner <[EMAIL PROTECTED]> wrote:
>
> Sorry if the answer to this question should be obvious, but where in
> the bin/nutch script do you need to add the following line to be able
> to test your regex-urlfilter.txt file from the command line?
>
> CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-
> regex.jar
>
>
>
> On 11/29/05, Thomas Delnoij <[EMAIL PROTECTED]> wrote:
> > For the sake of the archives, I will answer my own question here: I had
> to
> > add the following line to the bin/nutch script to be able to run
> > org.apache.nutch.net.RegexURLFilter from the command line:
> >
> > CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-
> regex.jar
> >
> > The nutch script overrides the classpath environment variable, so adding
> the
> > jar there didn't help.
> >
> > Rgrds, Thomas Delnoij
> >
> >
> > On 10/5/05, Thomas Delnoij <[EMAIL PROTECTED]> wrote:
> > >
> > > All.
> > >
> > > The problem is actualy a bit different. I was a bit in a hurry when I
> > > posted the previous message, apologies.
> > >
> > > I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.
> > >
> > > When I run java org.apache.nutch.net.RegexURLFilter, I am getting
> > >
> > > 051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1
> /nutch-
> > > 0.7.1.jar!/nutch-default.xml
> > > 051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1
> /nutch-
> > > 0.7.1.jar!/nutch-site.xml
> > > 051005 221040 Plugins: directory not found: plugins
> > > Exception in thread "main" java.lang.ExceptionInInitializerError
> > > Caused by: java.lang.NullPointerException
> > >         at org.apache.nutch.net.RegexURLFilter.<clinit>(
> > > RegexURLFilter.java:64)
> > >
> > > when I run nutch org.apache.nutch.net.RegexURLFilter, I am getting
> > >
> > > Exception in thread "main" java.lang.NoClassDefFoundError:
> > > org/apache/nutch/net/RegexURLFilter
> > >
> > > I know I am missing something obvious, but your help is really
> > > appreciated.
> > >
> > > Kind regards, Thomas Delnoij
> > >
> > >
> > > On 10/5/05, Thomas Delnoij <[EMAIL PROTECTED]> wrote:
> > > >
> > > > I was a bit in a hurry when I posted this message, apologies.
> > > >
> > > > The problem is actualy a bit different.
> > > >
> > > > I added both urlfilter-regex.jar and nutch-0.7.1.jar to my
> classpath.
> > > >
> > > > When I run java org.apache.nutch.net.RegexURLFilter,
> > > >
> > > > On 10/5/05, Thomas Delnoij < [EMAIL PROTECTED]> wrote:
> > > > >
> > > > > All.
> > > > >
> > > > > I want to run the RegexURLFilter's main() method for testing the
> > > > > regex-urlfilter.txt.
> > > > >
> > > > > I set up NUTCH_HOME and NUTCH_CONF_DIR so I think I set up my
> > > > > environment correctly.
> > > > >
> > > > > When I run nutch org.apache.nutch.net.RegexURLFilter I get
> Exception
> > > > > in thread "main" java.lang.NoClassDefFoundError:
> > > > > org/apache/nutch/net/RegexURLFilter.
> > > > >
> > > > > Assuming this was a classpath issue, I added
> > > > > NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar to my
> > > > > classpath.
> > > > >
> > > > > This did not solve the problem, as I am still getting the
> > > > > NoClassDefFoundError.
> > > > >
> > > > > So my first question is how to set up my environment correctly for
> > > > > testing the regex-urlfilter.
> > > > >
> > > > > Secondly, I want to tune my regex-urlfilter for maximum relevancy
> of
> > > > > the crawl result. By now, I have around 50 entries. My second
> question is if
> > > > > I can expect any performance impact?
> > > > >
> > > > > Your help is greatly appreciated.
> > > > >
> > > > > Kind regards, Thomas Delnoij.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> >
>

Reply via email to