I am newbie too... 

Look at src/plugin/parse-html, ParseHtml.java - here you can work
directly with Content object (HTTP binary response), split it to
HTTP-Headers, Metatags, and Body, and parse it...

public Parse getParse(Content content){...}

This method is called from org.apache.nutch.Fetcher

It seems that Nutch stores only parced data in "gzip" format, and in my
case I don't need to store plain HTML - only subset of HTML


-----Original Message-----
From: Sarah Zhai [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 17, 2005 8:51 PM
To: [email protected]
Cc: [EMAIL PROTECTED]
Subject: crawled page are not in HTML -- what should I do?


Hi,
I'm a newbie to Nutch.
I installed nutch and use it to do the crawling successfully.

The point is, I checked the crawled files under /segments/***/fetcher/ 
and they are not in .html or other similar format. 
(There are two files named "data" and "index" under each subfolder.)

Since I want to crawl thousands of web pages and parse the
HTML code of each web page...I was wondering, what should I 
do so that the crawled pages can be in HTML format?

Thanks.

--
sarah

Reply via email to