Re: Nutch-87 Setup

Matt Kangas Mon, 12 Sep 2005 11:48:53 -0700

Hi Michael,

Ordinarily there's no need to edit bin/nutch to run a specific class.If the class is in a JAR in <nutch-home>/lib, you can just say "nutch<full class name>". For example, the following two commands areequivalent:


$ nutch crawl
$ nutch org.apache.nutch.tools.CrawlTool

However, the situation is a little different for plugins. Ordinarilythe classes for a plugin are placed in <nutch-home>/plugins/<plugin-name>, not <nutch-home>/lib. To instantiate the plugin class, youmust *another* class which calls the appropriate plugin factory. ForURLFilter plugins, the factory class isorg.apache.nutch.net.URLFilters. This class does not have a main()method, but there is a helper class to test filters,URLFilterChecker. You can run it as follows:


$ nutch org.apache.nutch.net.URLFilterChecker -allCombined < urls.txt

Hope that helps. Let me know if that doesn't work for you.

--Matt

On Sep 11, 2005, at 3:20 PM, Michael Ji wrote:

hi Matt:

I implemented and compiled your patch in Nutch 07
successfully.

However, I met a running problem, when I want to test
patch manually by calling its' class.

I edited bin/nutch and added line,
"
elif [ "$COMMAND" = WhitelistFilterTester ] ; then
  CLASS=epile.crawl.plugin.WhitelistURLFilter
"

But when I call it, give me error as
"
Exception in thread "main"
java.lang.NoClassDefFoundError: epile/crawl/plugin/Wh
itelistURLFilter
"

I guess the classpath is not defined properly.

My environment setting as followings:

1. nutch build.xml
adding "<ant dir="epile" target="deploy"/> "

2. nutch/src/plugin/
create dir of "epile-basic/src/java"
then copy unzip nutch-87 of epile/crawl.. to that dir

3. I created plugin.xml in epile-basic/, same as the
one you loaded in patch;
and a new build.xml of
"
<?xml version="1.0"?>

<project name="WhitelistURLFilter" default="jar">

  <import file="../build-plugin.xml"/>

</project>

"

4. In nutch, I can run "ant" successfully,
in nutch/build/, a new WhitelistURLFilter/ is created
and with WhitelistURLFilter.class inside;

Did I miss something important?

thanks,

Michael Ji

=====================================================
--- Matt Kangas <[EMAIL PROTECTED]> wrote:

Hi Michael,

Only WhitelistURLFilter is a plugin class.
WhitelistWriter is a
utility for creating the on-disk hash used at
fetch/inject time by
WhitelistURLFilter. Sorry for the confusion. I will
add a sample
plugin.xml file to the ticket, which should help
make things clearer.

Also, "epile.util.*" are our proprietary classes.
LogLevel simply
retrieves a value from a file other than
nutch-site.xml. You can
safely replace the references to epile.util.LogLevel
with:

import org.apache.nutch.util.LogFormatter;
private static final Logger LOG =

LogFormatter.getLogger

(WhitelistURLFilter.class.getName());


StringURL is another utility class, probably not of
high value. It
just applies regexes to URL strings. The only
references to it that I
see are:

$ grep StringURL WhitelistURLFilter.java
import epile.crawl.util.StringURL;
    String hostname =

StringURL.extractHostname(url);

      String strippedURL =

StringURL.removeHostname(url);

        String domain =

StringURL.extractDomainFromHostname(hostname);

      if (StringURL.isCGI(url))


extractHostname() and removeHostname() can be
replaced with calls to
java.net.URL.getHost() and getPath(), respectively.
The other two are
simple to replicate, and can probably be commented
out for basic use.

Finally, to use this "new" plugin, you need to:

a) make sure a suitable directory is created under
"plugins",
including a plugin.xml and a jar with the
WhitelistURLFilter class

b) modify your nutch-site.xml to include the new
filter:

<property>

<name>epile.crawl.whitelist.enableUndirectedCrawl</name>

  <value>false</value>
</property>

<property>
  <name>urlfilter.whitelist.file</name>
  <value>/var/epile/crawl/whitelist_map</value>
  <description>Name of file containing the

location of the on-disk

whitelist map directory.</description>
</property>

<property>
  <name>plugin.includes</name>

<value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse-

(text|html)|index-basic|query-(basic|site|url)</value>

</property>

<property>
  <name>urlfilter.order</name>
  <value>org.apache.nutch.net.RegexURLFilter
epile.crawl.plugin.WhitelistURLFilter</value>
</property>


c) run WhitelistWriter before attempting to fetch,
so the filter has
some rules to work with.

I may have left out a crucial step or two here (0.5
wink ;), so feel
free to ask if anything seems unclear. I'll go
update the ticket now
to clarify these points.

--Matt


On Sep 10, 2005, at 11:45 PM, Michael Ji wrote:

hi Matt:

You nutch-87 has a good idea and I believe it

provides

a solution for good size of controled domain, say
hundreds of thousands sites.

I am currently trying to implement it to Nutch 07.

Got several questions want to be clearified:

1)
Should I create two plug-in classes in nutch?

etc
one for "WhitelistURLFilter"
one for "WhitelistWriter

2)
I found Whitelist.java refer to
"import epile.util.LogLevel;"

And
WhitelistURLFilter.java refer to
"import epile.crawl.util.StringURL;
import epile.util.LogLevel;"

Are these new package existing in Nutch lib? If

not,

should we import a new epile*.jar?

3)
If we want to use Nutch-87, should we change the

code

in Nutch core code.

I plan to "replace" all the places where
RegexURLFilter appearing by WhitelistURLFilter.

Is it a right approach?

thanks,

Michael Ji,


--
Matt Kangas / [EMAIL PROTECTED]



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com


--
Matt Kangas / [EMAIL PROTECTED]

Re: Nutch-87 Setup

Reply via email to