Hi,
I'm a newbie to Nutch.
I installed nutch and use it to do the crawling successfully.

The point is, I checked the crawled files under /segments/***/fetcher/ and they are not in .html or other similar format. (There are two files named "data" and "index" under each subfolder.)

Since I want to crawl thousands of web pages and parse the
HTML code of each web page...I was wondering, what should I do so that the crawled pages can be in HTML format?

Thanks.

--
sarah

Reply via email to