[Nutch-general] crawled page are not in HTML -- what should I do?

Sarah Zhai Thu, 18 Aug 2005 09:55:01 -0700

Hi,
I'm a newbie to Nutch.
I installed nutch and use it to do the crawling successfully.

The point is, I checked the crawled files under /segments/***/fetcher/and they are not in .html or other similar format.(There are two files named "data" and "index" under each subfolder.)


Since I want to crawl thousands of web pages and parse the

HTML code of each web page...I was wondering, what should Ido so that the crawled pages can be in HTML format?


Thanks.

--
sarah



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] crawled page are not in HTML -- what should I do?

Reply via email to