Re: Crawler Behavior (2 questions)

Sundaramoorthy Kannan Fri, 27 May 2005 06:11:24 -0700

Hi,
If I have to exclude some parts of a web page from getting indexed, how
can I do it? As I understand, DOMContentUtils class of HTML parser
plugin currently ignores only SCRIPT, STYLE and comment text. Can I
configure it to exclude some other tags too?


Thanks,
Kannan
On Thu, 2005-05-26 at 15:34 -0400, Andy Liu wrote:
> If you download the most recent version of Nutch from SVN, the newer
> CrawlTool doesn't fetch pages twice.
> 
> As far as limiting the number of pages to crawl, you can use the -topN
> flag when generating your segments.
> 
> Andy
> 
> On 5/26/05, Ian Reardon <[EMAIL PROTECTED]> wrote:
> > I have been crawling rather large sites ( larger then 10k pages) with
> > the crawl command.   It seems like it crawls all the pages twice.  Is
> > that normal?  I thought it was just removing the segments but it looks
> > like it crawls all the pages, does some update to the DB and then
> > crawls them again.  If anyone could shed some light on this I would
> > appreciate it.
> > 
> > 2nd Question.  Is there a way to limit a crawl to number of pages
> > rather then depth?  I would like to limit a crawl to say 100 pages,
> > 1000 pages of whatever.  I could brute force it by writing a script to
> > look at the logs and then killing the crawler but I'd rather not go
> > that approach.
> > 
> > Thanks.
> > 
> > Ian
> >

Re: Crawler Behavior (2 questions)

Reply via email to