Re: Nutch-87 Setup

Michael Ji Sun, 11 Sep 2005 12:20:49 -0700

hi Matt:

I implemented and compiled your patch in Nutch 07
successfully.


However, I met a running problem, when I want to test
patch manually by calling its' class.

I edited bin/nutch and added line, 
"
elif [ "$COMMAND" = WhitelistFilterTester ] ; then
  CLASS=epile.crawl.plugin.WhitelistURLFilter
"

But when I call it, give me error as 
"
Exception in thread "main"
java.lang.NoClassDefFoundError: epile/crawl/plugin/Wh
itelistURLFilter
"

I guess the classpath is not defined properly.

My environment setting as followings:

1. nutch build.xml 
adding "<ant dir="epile" target="deploy"/> "

2. nutch/src/plugin/
create dir of "epile-basic/src/java"
then copy unzip nutch-87 of epile/crawl.. to that dir

3. I created plugin.xml in epile-basic/, same as the
one you loaded in patch; 
and a new build.xml of
"
<?xml version="1.0"?>

<project name="WhitelistURLFilter" default="jar">

  <import file="../build-plugin.xml"/>

</project>

"

4. In nutch, I can run "ant" successfully, 
in nutch/build/, a new WhitelistURLFilter/ is created
and with WhitelistURLFilter.class inside;

Did I miss something important?

thanks,

Michael Ji

=====================================================
--- Matt Kangas <[EMAIL PROTECTED]> wrote:

> Hi Michael,
> 
> Only WhitelistURLFilter is a plugin class.
> WhitelistWriter is a  
> utility for creating the on-disk hash used at
> fetch/inject time by  
> WhitelistURLFilter. Sorry for the confusion. I will
> add a sample  
> plugin.xml file to the ticket, which should help
> make things clearer.
> 
> Also, "epile.util.*" are our proprietary classes.
> LogLevel simply  
> retrieves a value from a file other than
> nutch-site.xml. You can  
> safely replace the references to epile.util.LogLevel
> with:
> 
> > import org.apache.nutch.util.LogFormatter;
> > private static final Logger LOG =
> LogFormatter.getLogger 
> > (WhitelistURLFilter.class.getName());
> 
> StringURL is another utility class, probably not of
> high value. It  
> just applies regexes to URL strings. The only
> references to it that I  
> see are:
> 
> > $ grep StringURL WhitelistURLFilter.java
> > import epile.crawl.util.StringURL;
> >     String hostname =
> StringURL.extractHostname(url);
> >       String strippedURL =
> StringURL.removeHostname(url);
> >         String domain =
> StringURL.extractDomainFromHostname(hostname);
> >       if (StringURL.isCGI(url))
> 
> extractHostname() and removeHostname() can be
> replaced with calls to  
> java.net.URL.getHost() and getPath(), respectively.
> The other two are  
> simple to replicate, and can probably be commented
> out for basic use.
> 
> Finally, to use this "new" plugin, you need to:
> 
> a) make sure a suitable directory is created under
> "plugins",  
> including a plugin.xml and a jar with the
> WhitelistURLFilter class
> 
> b) modify your nutch-site.xml to include the new
> filter:
> 
> > <property>
> >  
>
<name>epile.crawl.whitelist.enableUndirectedCrawl</name>
> >   <value>false</value>
> > </property>
> >
> > <property>
> >   <name>urlfilter.whitelist.file</name>
> >   <value>/var/epile/crawl/whitelist_map</value>
> >   <description>Name of file containing the
> location of the on-disk  
> > whitelist map directory.</description>
> > </property>
> >
> > <property>
> >   <name>plugin.includes</name>
> >  
>
<value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse-
> 
> >
>
(text|html)|index-basic|query-(basic|site|url)</value>
> > </property>
> >
> > <property>
> >   <name>urlfilter.order</name>
> >   <value>org.apache.nutch.net.RegexURLFilter  
> > epile.crawl.plugin.WhitelistURLFilter</value>
> > </property>
> 
> c) run WhitelistWriter before attempting to fetch,
> so the filter has  
> some rules to work with.
> 
> I may have left out a crucial step or two here (0.5
> wink ;), so feel  
> free to ask if anything seems unclear. I'll go
> update the ticket now  
> to clarify these points.
> 
> --Matt
> 
> 
> On Sep 10, 2005, at 11:45 PM, Michael Ji wrote:
> 
> > hi Matt:
> >
> > You nutch-87 has a good idea and I believe it
> provides
> > a solution for good size of controled domain, say
> > hundreds of thousands sites.
> >
> > I am currently trying to implement it to Nutch 07.
> >
> > Got several questions want to be clearified:
> >
> > 1)
> > Should I create two plug-in classes in nutch?
> >
> > etc
> > one for "WhitelistURLFilter"
> > one for "WhitelistWriter
> >
> > 2)
> > I found Whitelist.java refer to
> > "import epile.util.LogLevel;"
> >
> > And
> > WhitelistURLFilter.java refer to
> > "import epile.crawl.util.StringURL;
> > import epile.util.LogLevel;"
> >
> > Are these new package existing in Nutch lib? If
> not,
> > should we import a new epile*.jar?
> >
> > 3)
> > If we want to use Nutch-87, should we change the
> code
> > in Nutch core code.
> >
> > I plan to "replace" all the places where
> > RegexURLFilter appearing by WhitelistURLFilter.
> >
> > Is it a right approach?
> >
> > thanks,
> >
> > Michael Ji,
> >
> 
> --
> Matt Kangas / [EMAIL PROTECTED]
> 
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Nutch-87 Setup

Reply via email to