hi Matt: I implemented and compiled your patch in Nutch 07 successfully.
However, I met a running problem, when I want to test patch manually by calling its' class. I edited bin/nutch and added line, " elif [ "$COMMAND" = WhitelistFilterTester ] ; then CLASS=epile.crawl.plugin.WhitelistURLFilter " But when I call it, give me error as " Exception in thread "main" java.lang.NoClassDefFoundError: epile/crawl/plugin/Wh itelistURLFilter " I guess the classpath is not defined properly. My environment setting as followings: 1. nutch build.xml adding "<ant dir="epile" target="deploy"/> " 2. nutch/src/plugin/ create dir of "epile-basic/src/java" then copy unzip nutch-87 of epile/crawl.. to that dir 3. I created plugin.xml in epile-basic/, same as the one you loaded in patch; and a new build.xml of " <?xml version="1.0"?> <project name="WhitelistURLFilter" default="jar"> <import file="../build-plugin.xml"/> </project> " 4. In nutch, I can run "ant" successfully, in nutch/build/, a new WhitelistURLFilter/ is created and with WhitelistURLFilter.class inside; Did I miss something important? thanks, Michael Ji ===================================================== --- Matt Kangas <[EMAIL PROTECTED]> wrote: > Hi Michael, > > Only WhitelistURLFilter is a plugin class. > WhitelistWriter is a > utility for creating the on-disk hash used at > fetch/inject time by > WhitelistURLFilter. Sorry for the confusion. I will > add a sample > plugin.xml file to the ticket, which should help > make things clearer. > > Also, "epile.util.*" are our proprietary classes. > LogLevel simply > retrieves a value from a file other than > nutch-site.xml. You can > safely replace the references to epile.util.LogLevel > with: > > > import org.apache.nutch.util.LogFormatter; > > private static final Logger LOG = > LogFormatter.getLogger > > (WhitelistURLFilter.class.getName()); > > StringURL is another utility class, probably not of > high value. It > just applies regexes to URL strings. The only > references to it that I > see are: > > > $ grep StringURL WhitelistURLFilter.java > > import epile.crawl.util.StringURL; > > String hostname = > StringURL.extractHostname(url); > > String strippedURL = > StringURL.removeHostname(url); > > String domain = > StringURL.extractDomainFromHostname(hostname); > > if (StringURL.isCGI(url)) > > extractHostname() and removeHostname() can be > replaced with calls to > java.net.URL.getHost() and getPath(), respectively. > The other two are > simple to replicate, and can probably be commented > out for basic use. > > Finally, to use this "new" plugin, you need to: > > a) make sure a suitable directory is created under > "plugins", > including a plugin.xml and a jar with the > WhitelistURLFilter class > > b) modify your nutch-site.xml to include the new > filter: > > > <property> > > > <name>epile.crawl.whitelist.enableUndirectedCrawl</name> > > <value>false</value> > > </property> > > > > <property> > > <name>urlfilter.whitelist.file</name> > > <value>/var/epile/crawl/whitelist_map</value> > > <description>Name of file containing the > location of the on-disk > > whitelist map directory.</description> > > </property> > > > > <property> > > <name>plugin.includes</name> > > > <value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse- > > > > (text|html)|index-basic|query-(basic|site|url)</value> > > </property> > > > > <property> > > <name>urlfilter.order</name> > > <value>org.apache.nutch.net.RegexURLFilter > > epile.crawl.plugin.WhitelistURLFilter</value> > > </property> > > c) run WhitelistWriter before attempting to fetch, > so the filter has > some rules to work with. > > I may have left out a crucial step or two here (0.5 > wink ;), so feel > free to ask if anything seems unclear. I'll go > update the ticket now > to clarify these points. > > --Matt > > > On Sep 10, 2005, at 11:45 PM, Michael Ji wrote: > > > hi Matt: > > > > You nutch-87 has a good idea and I believe it > provides > > a solution for good size of controled domain, say > > hundreds of thousands sites. > > > > I am currently trying to implement it to Nutch 07. > > > > Got several questions want to be clearified: > > > > 1) > > Should I create two plug-in classes in nutch? > > > > etc > > one for "WhitelistURLFilter" > > one for "WhitelistWriter > > > > 2) > > I found Whitelist.java refer to > > "import epile.util.LogLevel;" > > > > And > > WhitelistURLFilter.java refer to > > "import epile.crawl.util.StringURL; > > import epile.util.LogLevel;" > > > > Are these new package existing in Nutch lib? If > not, > > should we import a new epile*.jar? > > > > 3) > > If we want to use Nutch-87, should we change the > code > > in Nutch core code. > > > > I plan to "replace" all the places where > > RegexURLFilter appearing by WhitelistURLFilter. > > > > Is it a right approach? > > > > thanks, > > > > Michael Ji, > > > > -- > Matt Kangas / [EMAIL PROTECTED] > > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
