Re: crawled page are not in HTML -- what should I do?

Michael Ji Wed, 17 Aug 2005 18:07:54 -0700

Nutch has its own parser, and the file "data"+"index"
is the result of parsing and indexing;


Nutch is a search engine, means it provide a search
interface (by Tomcat) and user can do text-search
based on the site crawling (etc Googl)

Nutch is not a crawl tool that fetches raw html data
from website and save content in local disk as
ordinary crawler does;

Michael Ji

--- Sarah Zhai <[EMAIL PROTECTED]> wrote:

> Hi,
> I'm a newbie to Nutch.
> I installed nutch and use it to do the crawling
> successfully.
> 
> The point is, I checked the crawled files under
> /segments/***/fetcher/ 
> and they are not in .html or other similar format. 
> (There are two files named "data" and "index" under
> each subfolder.)
> 
> Since I want to crawl thousands of web pages and
> parse the
> HTML code of each web page...I was wondering, what
> should I 
> do so that the crawled pages can be in HTML format?
> 
> Thanks.
> 
> --
> sarah
> 
> 



                
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Re: crawled page are not in HTML -- what should I do?

Reply via email to