Hi,
I'm a newbie to Nutch.
I installed nutch and use it to do the crawling successfully.
The point is, I checked the crawled files under /segments/***/fetcher/
and they are not in .html or other similar format.
(There are two files named "data" and "index" under each subfolder.)
Since I want to crawl thousands of web pages and parse the
HTML code of each web page...I was wondering, what should I
do so that the crawled pages can be in HTML format?
Thanks.
--
sarah
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general