Seems you subscribe on the wiki. You can unsubscribe there. On Tuesday 12 July 2011 11:52:23 Marcel Schubert wrote: > Am 12.07.2011 11:39, schrieb Apache Wiki: > > Dear Wiki user, > > > > You have subscribed to a wiki page or wiki category on "Nutch Wiki" for > > change notification. > > > > The "NutchTutorial" page has been changed by JulienNioche: > > http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=33&rev2=34 > > > > Comment: > > Removed reference to crawl-urlfitler.txt > > > > * Create a directory with a flat file of root urls. For example, to > > crawl the nutch site you might start with a file named urls/nutch > > containing the url of just the Nutch home page. All other Nutch > > pages should be reachable from this page. The urls/nutch file would > > thus contain: {{{ http://lucene.apache.org/nutch/ }}} > > > > + * Edit the file conf/regex-urlfilter.txt and replace > > - * Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME > > with the name of the domain you wish to crawl. For example, if you > > wished to limit the crawl to the apache.org domain, the line should > > read: - {{{ +^http://([a-z0-9]*\.)*apache.org/ }}} This will include > > any url in the domain apache.org. > > > > - * Until someone could explain this...When I use the file > > crawl-urlfilter.txt the filter doesn't work, instead of it use the file > > conf/regex-urlfilter.txt and change the last line from "+." to "-." + > > {{{ > > + # accept anything else > > + +. > > + }}} > > + > > + with a regular expression matching the domain you wish to crawl. For > > example, if you wished to limit the crawl to the apache.org domain, the > > line should read: + > > + {{{ > > + +^http://([a-z0-9]*\.)*apache.org/ > > + }}} > > + > > + This will include any url in the domain apache.org. > > > > === Crawl Command: Running the Crawl === > > > > Once things are configured, running the crawl is easy. Just use the crawl command. Its options include: > > @@ -162, +172 @@ > > > > Now we're ready to search! > > > > - == Command Line Searching == > > + == Command Line Searching (version< 1.3) == > > > > Simplest way to verify the integrity of your crawl is to launch > > NutchBean from command line: > > > > {{{ bin/nutch org.apache.nutch.searcher.NutchBean apache }}} > > > > where ''apache'' is the search term (note that NutchBean will only > > search pages in the {{{crawl}}} directory, so if you named the crawl > > directory something else, NutchBean will not find any results). After > > you have verified that the above command returns results you can > > proceed to setting up the web interface. > > > > - == Installing in Tomcat == > > + == Installing in Tomcat (version< 1.3) == > > > > To search you need to put the nutch war file into your servlet > > container. (If instead of downloading a Nutch release you checked the > > sources out of SVN, then you'll first need to build the war file, > > with the command {{{ant war}}}.) > > > Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file may be installed with the commands: > Hey, > > please delete my E-Mail address from your mailing list or whatever. I > receive more than 50 mails every day. > > Bye
-- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

