Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below.
I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson <olson_...@yahoo.it> ha scritto: > Da: O. Olson <olson_...@yahoo.it> > Oggetto: Using Nutch for only retriving HTML > A: nutch-user@lucene.apache.org > Data: Giovedì 24 settembre 2009, 20:54 > Hi, > I am new to Nutch. I would like to > completely crawl through an Internal Website and retrieve > all the HTML Content. I don’t intend to do further > processing using Nutch. > The Website/Content is rather huge. By crawl, I mean that I > would go to a page, download/archive the HTML, get the links > from that page, and then download/archive those pages. I > would keep doing this till I don’t have any new links. > > Is this possible? Is this the right tool for this job, or > are there other tools out there that would be more suited > for my purpose? > > Thanks, > O.O. > > > > >