Hi Michael,

Only WhitelistURLFilter is a plugin class. WhitelistWriter is a utility for creating the on-disk hash used at fetch/inject time by WhitelistURLFilter. Sorry for the confusion. I will add a sample plugin.xml file to the ticket, which should help make things clearer.

Also, "epile.util.*" are our proprietary classes. LogLevel simply retrieves a value from a file other than nutch-site.xml. You can safely replace the references to epile.util.LogLevel with:

import org.apache.nutch.util.LogFormatter;
private static final Logger LOG = LogFormatter.getLogger (WhitelistURLFilter.class.getName());

StringURL is another utility class, probably not of high value. It just applies regexes to URL strings. The only references to it that I see are:

$ grep StringURL WhitelistURLFilter.java
import epile.crawl.util.StringURL;
    String hostname = StringURL.extractHostname(url);
      String strippedURL = StringURL.removeHostname(url);
        String domain = StringURL.extractDomainFromHostname(hostname);
      if (StringURL.isCGI(url))

extractHostname() and removeHostname() can be replaced with calls to java.net.URL.getHost() and getPath(), respectively. The other two are simple to replicate, and can probably be commented out for basic use.

Finally, to use this "new" plugin, you need to:

a) make sure a suitable directory is created under "plugins", including a plugin.xml and a jar with the WhitelistURLFilter class

b) modify your nutch-site.xml to include the new filter:

<property>
  <name>epile.crawl.whitelist.enableUndirectedCrawl</name>
  <value>false</value>
</property>

<property>
  <name>urlfilter.whitelist.file</name>
  <value>/var/epile/crawl/whitelist_map</value>
<description>Name of file containing the location of the on-disk whitelist map directory.</description>
</property>

<property>
  <name>plugin.includes</name>
<value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse- (text|html)|index-basic|query-(basic|site|url)</value>
</property>

<property>
  <name>urlfilter.order</name>
<value>org.apache.nutch.net.RegexURLFilter epile.crawl.plugin.WhitelistURLFilter</value>
</property>

c) run WhitelistWriter before attempting to fetch, so the filter has some rules to work with.

I may have left out a crucial step or two here (0.5 wink ;), so feel free to ask if anything seems unclear. I'll go update the ticket now to clarify these points.

--Matt


On Sep 10, 2005, at 11:45 PM, Michael Ji wrote:

hi Matt:

You nutch-87 has a good idea and I believe it provides
a solution for good size of controled domain, say
hundreds of thousands sites.

I am currently trying to implement it to Nutch 07.

Got several questions want to be clearified:

1)
Should I create two plug-in classes in nutch?

etc
one for "WhitelistURLFilter"
one for "WhitelistWriter

2)
I found Whitelist.java refer to
"import epile.util.LogLevel;"

And
WhitelistURLFilter.java refer to
"import epile.crawl.util.StringURL;
import epile.util.LogLevel;"

Are these new package existing in Nutch lib? If not,
should we import a new epile*.jar?

3)
If we want to use Nutch-87, should we change the code
in Nutch core code.

I plan to "replace" all the places where
RegexURLFilter appearing by WhitelistURLFilter.

Is it a right approach?

thanks,

Michael Ji,


--
Matt Kangas / [EMAIL PROTECTED]


Reply via email to