Re: [Nutch Wiki] Update of "NutchTutorial" by JulienNioche

Markus Jelsma Tue, 12 Jul 2011 06:15:22 -0700

Seems you subscribe on the wiki. You can unsubscribe there.

On Tuesday 12 July 2011 11:52:23 Marcel Schubert wrote:
> Am 12.07.2011 11:39, schrieb Apache Wiki:
> > Dear Wiki user,
> > 
> > You have subscribed to a wiki page or wiki category on "Nutch Wiki" for
> > change notification.
> > 
> > The "NutchTutorial" page has been changed by JulienNioche:
> > http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=33&rev2=34
> > 
> > Comment:
> > Removed reference to crawl-urlfitler.txt
> > 
> >     * Create a directory with a flat file of root urls. For example, to
> >     crawl the nutch site you might start with a file named urls/nutch
> >     containing the url of just the Nutch home page. All other Nutch
> >     pages should be reachable from this page. The urls/nutch file would
> >     thus contain: {{{ http://lucene.apache.org/nutch/ }}}
> > 
> > +  * Edit the file conf/regex-urlfilter.txt and replace
> > -  * Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME
> > with the name of the domain you wish to crawl. For example, if you
> > wished to limit the crawl to the apache.org domain, the line should
> > read: -  {{{ +^http://([a-z0-9]*\.)*apache.org/ }}} This will include
> > any url in the domain apache.org.
> > 
> > - * Until someone could explain this...When I use the file
> > crawl-urlfilter.txt the filter doesn't work, instead of it use the file
> > conf/regex-urlfilter.txt and change the last line from "+." to "-." +
> > {{{
> > + # accept anything else
> > + +.
> > + }}}
> > +
> > + with a regular expression matching the domain you wish to crawl. For
> > example, if you wished to limit the crawl to the apache.org domain, the
> > line should read: +
> > + {{{
> > +  +^http://([a-z0-9]*\.)*apache.org/
> > + }}}
> > +
> > + This will include any url in the domain apache.org.
> > 
> >    === Crawl Command: Running the Crawl ===
> > 
> >    Once things are configured, running the crawl is easy. Just use the 
crawl command. Its options include:
> > @@ -162, +172 @@
> > 
> >    Now we're ready to search!
> > 
> > - == Command Line Searching ==
> > + == Command Line Searching (version<  1.3)  ==
> > 
> >    Simplest way to verify the integrity of your crawl is to launch
> >    NutchBean from command line:
> >    
> >    {{{ bin/nutch org.apache.nutch.searcher.NutchBean apache }}}
> >    
> >    where ''apache'' is the search term (note that NutchBean will only
> >    search pages in the {{{crawl}}} directory, so if you named the crawl
> >    directory something else, NutchBean will not find any results). After
> >    you have verified that the above command returns results you can
> >    proceed to setting up the web interface.
> > 
> > - == Installing in Tomcat ==
> > + == Installing in Tomcat (version<  1.3) ==
> > 
> >    To search you need to put the nutch war file into your servlet
> >    container. (If instead of downloading a Nutch release you checked the
> >    sources out of SVN, then you'll first need to build the war file,
> >    with the command {{{ant war}}}.)
> 
> >    Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war 
file may be installed with the commands:
> Hey,
> 
> please delete my E-Mail address from your mailing list or whatever. I
> receive more than 50 mails every day.
> 
> Bye


-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: [Nutch Wiki] Update of "NutchTutorial" by JulienNioche

Reply via email to