hi Matt:

Thanks your advice.

I can trigger URLFilterChecker successfully, however,
get the following error, complain about index filter.
Could you let me know where the problem will be?

"
050921 191015 impl:
point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter

050921 191015 not including:
E:\programs\cygwin\home\fji\versionControl\nutch_V07_P87\nutch\build\plugins\WhitelistURLFilter

050921 191015 SEVERE
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
Exception in thread "main"
java.lang.ExceptionInInitializerError
        at
org.apache.nutch.net.URLFilterChecker.checkAll(URLFilterChecker.java:93)
        at
org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
Caused by: java.lang.RuntimeException:
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
        at
org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:147)
        at
org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:40)
        ... 2 more
Caused by:
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
        at
org.apache.nutch.plugin.PluginRepository.installExtensions(PluginRepository.java:78)
        at
org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:61)
        at
org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:144)
        ... 3 more
"

thanks,

Michael Ji


--- Matt Kangas <[EMAIL PROTECTED]> wrote:

> Hi Michael,
> 
> Ordinarily there's no need to edit bin/nutch to run
> a specific class.  
> If the class is in a JAR in <nutch-home>/lib, you
> can just say "nutch  
> <full class name>". For example, the following two
> commands are  
> equivalent:
> 
> $ nutch crawl
> $ nutch org.apache.nutch.tools.CrawlTool
> 
> However, the situation is a little different for
> plugins. Ordinarily  
> the classes for a plugin are placed in
> <nutch-home>/plugins/<plugin- 
> name>, not <nutch-home>/lib. To instantiate the
> plugin class, you  
> must *another* class which calls the appropriate
> plugin factory. For  
> URLFilter plugins, the factory class is  
> org.apache.nutch.net.URLFilters. This class does not
> have a main()  
> method, but there is a helper class to test filters,
>  
> URLFilterChecker. You can run it as follows:
> 
> $ nutch org.apache.nutch.net.URLFilterChecker
> -allCombined < urls.txt
> 
> Hope that helps. Let me know if that doesn't work
> for you.
> 
> --Matt
> 
> On Sep 11, 2005, at 3:20 PM, Michael Ji wrote:
> 
> > hi Matt:
> >
> > I implemented and compiled your patch in Nutch 07
> > successfully.
> >
> > However, I met a running problem, when I want to
> test
> > patch manually by calling its' class.
> >
> > I edited bin/nutch and added line,
> > "
> > elif [ "$COMMAND" = WhitelistFilterTester ] ; then
> >   CLASS=epile.crawl.plugin.WhitelistURLFilter
> > "
> >
> > But when I call it, give me error as
> > "
> > Exception in thread "main"
> > java.lang.NoClassDefFoundError:
> epile/crawl/plugin/Wh
> > itelistURLFilter
> > "
> >
> > I guess the classpath is not defined properly.
> >
> > My environment setting as followings:
> >
> > 1. nutch build.xml
> > adding "<ant dir="epile" target="deploy"/> "
> >
> > 2. nutch/src/plugin/
> > create dir of "epile-basic/src/java"
> > then copy unzip nutch-87 of epile/crawl.. to that
> dir
> >
> > 3. I created plugin.xml in epile-basic/, same as
> the
> > one you loaded in patch;
> > and a new build.xml of
> > "
> > <?xml version="1.0"?>
> >
> > <project name="WhitelistURLFilter" default="jar">
> >
> >   <import file="../build-plugin.xml"/>
> >
> > </project>
> >
> > "
> >
> > 4. In nutch, I can run "ant" successfully,
> > in nutch/build/, a new WhitelistURLFilter/ is
> created
> > and with WhitelistURLFilter.class inside;
> >
> > Did I miss something important?
> >
> > thanks,
> >
> > Michael Ji
> >
> >
>
=====================================================
> > --- Matt Kangas <[EMAIL PROTECTED]> wrote:
> >
> >
> >> Hi Michael,
> >>
> >> Only WhitelistURLFilter is a plugin class.
> >> WhitelistWriter is a
> >> utility for creating the on-disk hash used at
> >> fetch/inject time by
> >> WhitelistURLFilter. Sorry for the confusion. I
> will
> >> add a sample
> >> plugin.xml file to the ticket, which should help
> >> make things clearer.
> >>
> >> Also, "epile.util.*" are our proprietary classes.
> >> LogLevel simply
> >> retrieves a value from a file other than
> >> nutch-site.xml. You can
> >> safely replace the references to
> epile.util.LogLevel
> >> with:
> >>
> >>
> >>> import org.apache.nutch.util.LogFormatter;
> >>> private static final Logger LOG =
> >>>
> >> LogFormatter.getLogger
> >>
> >>> (WhitelistURLFilter.class.getName());
> >>>
> >>
> >> StringURL is another utility class, probably not
> of
> >> high value. It
> >> just applies regexes to URL strings. The only
> >> references to it that I
> >> see are:
> >>
> >>
> >>> $ grep StringURL WhitelistURLFilter.java
> >>> import epile.crawl.util.StringURL;
> >>>     String hostname =
> >>>
> >> StringURL.extractHostname(url);
> >>
> >>>       String strippedURL =
> >>>
> >> StringURL.removeHostname(url);
> >>
> >>>         String domain =
> >>>
> >> StringURL.extractDomainFromHostname(hostname);
> >>
> >>>       if (StringURL.isCGI(url))
> >>>
> >>
> >> extractHostname() and removeHostname() can be
> >> replaced with calls to
> >> java.net.URL.getHost() and getPath(),
> respectively.
> >> The other two are
> >> simple to replicate, and can probably be
> commented
> >> out for basic use.
> >>
> >> Finally, to use this "new" plugin, you need to:
> >>
> >> a) make sure a suitable directory is created
> under
> >> "plugins",
> >> including a plugin.xml and a jar with the
> >> WhitelistURLFilter class
> >>
> >> b) modify your nutch-site.xml to include the new
> >> filter:
> >>
> >>
> >>> <property>
> >>>
> >>>
> >>
> >>
> >
>
<name>epile.crawl.whitelist.enableUndirectedCrawl</name>
> >
> >>>   <value>false</value>
> >>> </property>
> >>>
> >>> <property>
> >>>   <name>urlfilter.whitelist.file</name>
> >>>   <value>/var/epile/crawl/whitelist_map</value>
> >>>   <description>Name of file containing the
> >>>
> >> location of the on-disk
> >>
> >>> whitelist map directory.</description>
> >>> </property>
> >>>
> >>> <property>
> 
=== message truncated ===



                
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Reply via email to