About 2 months ago John Kleven posted asking about using nutch just to crawl.
I have the same question, essentially. One possible development tack I can
take with my project is: use nutch for crawling, then use Xapian for
tokenization, indexing, etc. Over time we will need to spider a lot of sites
so I'm disinclined to use wget.
Does nutch have out-of-the-box capability to spider sites and write the output
to html files? If not, can someone give me a quick summary of how I would
properly modify or subclass the nutch code?
____________________________________________________________________________________
Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel
and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7