Hi, If I have to exclude some parts of a web page from getting indexed, how can I do it? As I understand, DOMContentUtils class of HTML parser plugin currently ignores only SCRIPT, STYLE and comment text. Can I configure it to exclude some other tags too?
Thanks, Kannan On Thu, 2005-05-26 at 15:34 -0400, Andy Liu wrote: > If you download the most recent version of Nutch from SVN, the newer > CrawlTool doesn't fetch pages twice. > > As far as limiting the number of pages to crawl, you can use the -topN > flag when generating your segments. > > Andy > > On 5/26/05, Ian Reardon <[EMAIL PROTECTED]> wrote: > > I have been crawling rather large sites ( larger then 10k pages) with > > the crawl command. It seems like it crawls all the pages twice. Is > > that normal? I thought it was just removing the segments but it looks > > like it crawls all the pages, does some update to the DB and then > > crawls them again. If anyone could shed some light on this I would > > appreciate it. > > > > 2nd Question. Is there a way to limit a crawl to number of pages > > rather then depth? I would like to limit a crawl to say 100 pages, > > 1000 pages of whatever. I could brute force it by writing a script to > > look at the logs and then killing the crawler but I'd rather not go > > that approach. > > > > Thanks. > > > > Ian > >
