Re: Nutch-87 Setup

Matt Kangas Sat, 10 Sep 2005 22:33:27 -0700

Hi Michael,

Only WhitelistURLFilter is a plugin class. WhitelistWriter is autility for creating the on-disk hash used at fetch/inject time byWhitelistURLFilter. Sorry for the confusion. I will add a sampleplugin.xml file to the ticket, which should help make things clearer.

Also, "epile.util.*" are our proprietary classes. LogLevel simplyretrieves a value from a file other than nutch-site.xml. You cansafely replace the references to epile.util.LogLevel with:

import org.apache.nutch.util.LogFormatter;
private static final Logger LOG = LogFormatter.getLogger(WhitelistURLFilter.class.getName());

StringURL is another utility class, probably not of high value. Itjust applies regexes to URL strings. The only references to it that Isee are:

$ grep StringURL WhitelistURLFilter.java
import epile.crawl.util.StringURL;
    String hostname = StringURL.extractHostname(url);
      String strippedURL = StringURL.removeHostname(url);
        String domain = StringURL.extractDomainFromHostname(hostname);
      if (StringURL.isCGI(url))

extractHostname() and removeHostname() can be replaced with calls tojava.net.URL.getHost() and getPath(), respectively. The other two aresimple to replicate, and can probably be commented out for basic use.


Finally, to use this "new" plugin, you need to:

a) make sure a suitable directory is created under "plugins",including a plugin.xml and a jar with the WhitelistURLFilter class


b) modify your nutch-site.xml to include the new filter:

<property>
  <name>epile.crawl.whitelist.enableUndirectedCrawl</name>
  <value>false</value>
</property>

<property>
  <name>urlfilter.whitelist.file</name>
  <value>/var/epile/crawl/whitelist_map</value>
<description>Name of file containing the location of the on-diskwhitelist map directory.</description>
</property>

<property>
  <name>plugin.includes</name>
<value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse-(text|html)|index-basic|query-(basic|site|url)</value>
</property>

<property>
  <name>urlfilter.order</name>
<value>org.apache.nutch.net.RegexURLFilterepile.crawl.plugin.WhitelistURLFilter</value>
</property>

c) run WhitelistWriter before attempting to fetch, so the filter hassome rules to work with.

I may have left out a crucial step or two here (0.5 wink ;), so feelfree to ask if anything seems unclear. I'll go update the ticket nowto clarify these points.


--Matt


On Sep 10, 2005, at 11:45 PM, Michael Ji wrote:

hi Matt:

You nutch-87 has a good idea and I believe it provides
a solution for good size of controled domain, say
hundreds of thousands sites.

I am currently trying to implement it to Nutch 07.

Got several questions want to be clearified:

1)
Should I create two plug-in classes in nutch?

etc
one for "WhitelistURLFilter"
one for "WhitelistWriter

2)
I found Whitelist.java refer to
"import epile.util.LogLevel;"

And
WhitelistURLFilter.java refer to
"import epile.crawl.util.StringURL;
import epile.util.LogLevel;"

Are these new package existing in Nutch lib? If not,
should we import a new epile*.jar?

3)
If we want to use Nutch-87, should we change the code
in Nutch core code.

I plan to "replace" all the places where
RegexURLFilter appearing by WhitelistURLFilter.

Is it a right approach?

thanks,

Michael Ji,


--
Matt Kangas / [EMAIL PROTECTED]

Re: Nutch-87 Setup

Reply via email to