hi Matt:
Thanks your advice.
I can trigger URLFilterChecker successfully, however,
get the following error, complain about index filter.
Could you let me know where the problem will be?
"
050921 191015 impl:
point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
050921 191015 not including:
E:\programs\cygwin\home\fji\versionControl\nutch_V07_P87\nutch\build\plugins\WhitelistURLFilter
050921 191015 SEVERE
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
Exception in thread "main"
java.lang.ExceptionInInitializerError
at
org.apache.nutch.net.URLFilterChecker.checkAll(URLFilterChecker.java:93)
at
org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
Caused by: java.lang.RuntimeException:
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
at
org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:147)
at
org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:40)
... 2 more
Caused by:
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
at
org.apache.nutch.plugin.PluginRepository.installExtensions(PluginRepository.java:78)
at
org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:61)
at
org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:144)
... 3 more
"
thanks,
Michael Ji
--- Matt Kangas <[EMAIL PROTECTED]> wrote:
> Hi Michael,
>
> Ordinarily there's no need to edit bin/nutch to run
> a specific class.
> If the class is in a JAR in <nutch-home>/lib, you
> can just say "nutch
> <full class name>". For example, the following two
> commands are
> equivalent:
>
> $ nutch crawl
> $ nutch org.apache.nutch.tools.CrawlTool
>
> However, the situation is a little different for
> plugins. Ordinarily
> the classes for a plugin are placed in
> <nutch-home>/plugins/<plugin-
> name>, not <nutch-home>/lib. To instantiate the
> plugin class, you
> must *another* class which calls the appropriate
> plugin factory. For
> URLFilter plugins, the factory class is
> org.apache.nutch.net.URLFilters. This class does not
> have a main()
> method, but there is a helper class to test filters,
>
> URLFilterChecker. You can run it as follows:
>
> $ nutch org.apache.nutch.net.URLFilterChecker
> -allCombined < urls.txt
>
> Hope that helps. Let me know if that doesn't work
> for you.
>
> --Matt
>
> On Sep 11, 2005, at 3:20 PM, Michael Ji wrote:
>
> > hi Matt:
> >
> > I implemented and compiled your patch in Nutch 07
> > successfully.
> >
> > However, I met a running problem, when I want to
> test
> > patch manually by calling its' class.
> >
> > I edited bin/nutch and added line,
> > "
> > elif [ "$COMMAND" = WhitelistFilterTester ] ; then
> > CLASS=epile.crawl.plugin.WhitelistURLFilter
> > "
> >
> > But when I call it, give me error as
> > "
> > Exception in thread "main"
> > java.lang.NoClassDefFoundError:
> epile/crawl/plugin/Wh
> > itelistURLFilter
> > "
> >
> > I guess the classpath is not defined properly.
> >
> > My environment setting as followings:
> >
> > 1. nutch build.xml
> > adding "<ant dir="epile" target="deploy"/> "
> >
> > 2. nutch/src/plugin/
> > create dir of "epile-basic/src/java"
> > then copy unzip nutch-87 of epile/crawl.. to that
> dir
> >
> > 3. I created plugin.xml in epile-basic/, same as
> the
> > one you loaded in patch;
> > and a new build.xml of
> > "
> > <?xml version="1.0"?>
> >
> > <project name="WhitelistURLFilter" default="jar">
> >
> > <import file="../build-plugin.xml"/>
> >
> > </project>
> >
> > "
> >
> > 4. In nutch, I can run "ant" successfully,
> > in nutch/build/, a new WhitelistURLFilter/ is
> created
> > and with WhitelistURLFilter.class inside;
> >
> > Did I miss something important?
> >
> > thanks,
> >
> > Michael Ji
> >
> >
>
=====================================================
> > --- Matt Kangas <[EMAIL PROTECTED]> wrote:
> >
> >
> >> Hi Michael,
> >>
> >> Only WhitelistURLFilter is a plugin class.
> >> WhitelistWriter is a
> >> utility for creating the on-disk hash used at
> >> fetch/inject time by
> >> WhitelistURLFilter. Sorry for the confusion. I
> will
> >> add a sample
> >> plugin.xml file to the ticket, which should help
> >> make things clearer.
> >>
> >> Also, "epile.util.*" are our proprietary classes.
> >> LogLevel simply
> >> retrieves a value from a file other than
> >> nutch-site.xml. You can
> >> safely replace the references to
> epile.util.LogLevel
> >> with:
> >>
> >>
> >>> import org.apache.nutch.util.LogFormatter;
> >>> private static final Logger LOG =
> >>>
> >> LogFormatter.getLogger
> >>
> >>> (WhitelistURLFilter.class.getName());
> >>>
> >>
> >> StringURL is another utility class, probably not
> of
> >> high value. It
> >> just applies regexes to URL strings. The only
> >> references to it that I
> >> see are:
> >>
> >>
> >>> $ grep StringURL WhitelistURLFilter.java
> >>> import epile.crawl.util.StringURL;
> >>> String hostname =
> >>>
> >> StringURL.extractHostname(url);
> >>
> >>> String strippedURL =
> >>>
> >> StringURL.removeHostname(url);
> >>
> >>> String domain =
> >>>
> >> StringURL.extractDomainFromHostname(hostname);
> >>
> >>> if (StringURL.isCGI(url))
> >>>
> >>
> >> extractHostname() and removeHostname() can be
> >> replaced with calls to
> >> java.net.URL.getHost() and getPath(),
> respectively.
> >> The other two are
> >> simple to replicate, and can probably be
> commented
> >> out for basic use.
> >>
> >> Finally, to use this "new" plugin, you need to:
> >>
> >> a) make sure a suitable directory is created
> under
> >> "plugins",
> >> including a plugin.xml and a jar with the
> >> WhitelistURLFilter class
> >>
> >> b) modify your nutch-site.xml to include the new
> >> filter:
> >>
> >>
> >>> <property>
> >>>
> >>>
> >>
> >>
> >
>
<name>epile.crawl.whitelist.enableUndirectedCrawl</name>
> >
> >>> <value>false</value>
> >>> </property>
> >>>
> >>> <property>
> >>> <name>urlfilter.whitelist.file</name>
> >>> <value>/var/epile/crawl/whitelist_map</value>
> >>> <description>Name of file containing the
> >>>
> >> location of the on-disk
> >>
> >>> whitelist map directory.</description>
> >>> </property>
> >>>
> >>> <property>
>
=== message truncated ===
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com