I am newbie too...
Look at src/plugin/parse-html, ParseHtml.java - here you can work
directly with Content object (HTTP binary response), split it to
HTTP-Headers, Metatags, and Body, and parse it...
public Parse getParse(Content content){...}
This method is called from org.apache.nutch.Fetcher
It seems that Nutch stores only parced data in "gzip" format, and in my
case I don't need to store plain HTML - only subset of HTML
-----Original Message-----
From: Sarah Zhai [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 17, 2005 8:51 PM
To: [email protected]
Cc: [EMAIL PROTECTED]
Subject: crawled page are not in HTML -- what should I do?
Hi,
I'm a newbie to Nutch.
I installed nutch and use it to do the crawling successfully.
The point is, I checked the crawled files under /segments/***/fetcher/
and they are not in .html or other similar format.
(There are two files named "data" and "index" under each subfolder.)
Since I want to crawl thousands of web pages and parse the
HTML code of each web page...I was wondering, what should I
do so that the crawled pages can be in HTML format?
Thanks.
--
sarah