Nutch has its own parser, and the file "data"+"index"
is the result of parsing and indexing;
Nutch is a search engine, means it provide a search
interface (by Tomcat) and user can do text-search
based on the site crawling (etc Googl)
Nutch is not a crawl tool that fetches raw html data
from website and save content in local disk as
ordinary crawler does;
Michael Ji
--- Sarah Zhai <[EMAIL PROTECTED]> wrote:
> Hi,
> I'm a newbie to Nutch.
> I installed nutch and use it to do the crawling
> successfully.
>
> The point is, I checked the crawled files under
> /segments/***/fetcher/
> and they are not in .html or other similar format.
> (There are two files named "data" and "index" under
> each subfolder.)
>
> Since I want to crawl thousands of web pages and
> parse the
> HTML code of each web page...I was wondering, what
> should I
> do so that the crawled pages can be in HTML format?
>
> Thanks.
>
> --
> sarah
>
>
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs