Nutch 0.9: how to store fetched *.html files locally?

Jose C. Lacal Fri, 22 Feb 2008 15:15:00 -0800

Good afternoon:

Newbie question here: Nutch 0.9 works fine, the issue is I would like to
locally store the *.html files Nutch is fetching. That is, out of my
list of URLs, I want Nutch to store each *.html in a directory of my
choosing.


I read an earlier reply to the mailing list along these lines:

From: "Martin Kuen" <[EMAIL PROTECTED]>
..
>
> Hi,
>
> Thank you :)
> One more question for the fetched page reading: I prefer I can dump
the
> fetched page into a single html file.
You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to
create a seperate file for each downloaded file.
You could modify the SegmentReader class (
org.apache.nutch.segment.SegmentReader) if you want to do that.


Since I am not a Java expert I was wondering if somebody else has
tackled this issue before.

Also, would it be feasible to add this as a feature request for future
releases? Nutch's fetch capability is very useful by itself, it might
not be that difficult to expose this feature via the nutch-site.xml
file.


Regards.



-- 

Jose C. Lacal, Founder &  Chief Vision Officer

Open Personalized Health Informatics "OpenPHI"
15625 NW  15th Avenue; Suite 15
Miami, FL 33169-5601  USA       www.OpenPHI.com
[O] +1 (305) 395-6091     [M] +1 (954) 553-1984
[EMAIL PROTECTED]    [F] +1 (954) 364-7144

Nutch 0.9: how to store fetched *.html files locally?

Reply via email to