[Nutch-dev] Re: Why Crawl failed to fetch so many pages?

Nutch开发邮件 Sun, 03 Jul 2005 20:21:11 -0700

please modify below 
（# skip URLs containing certain characters as probable queries, etc.
# [EMAIL PROTECTED]
because the link 
http://news.buaa.edu.cn/dispnews.php?type=1&nid=2500&s_table=news_txt
includes the ?=& which will be ignored 
it will be 
(# skip URLs containing certain characters as probable queries, etc.
# [EMAIL PROTECTED])




2005/4/14, Andy Liu <[EMAIL PROTECTED]>:
> 
> By default, Nutch only crawls the first 100 outlinks on a page. Maybe
> that's your problem?
> 
> On 4/14/05, Matthias Jaekle <[EMAIL PROTECTED]> wrote:
> > > try
> > > +^http://news.buaa.edu.cn/*
> > This should not be the reason.
> > Your regex fits on urls starting with:
> > http://news.buaa.edu.cn
> > http://news.buaa.edu.cn/
> > http://news.buaa.edu.cn//
> > http://news.buaa.edu.cn/// ...
> >
> > The only thing I would try is to escape some caracters to make it more
> > correct. A dot means every possible sign. Better:
> > +^http:\/\/news\.buaa\.edu\.cn
> >
> > Did you make enough rounds, to get the wanted depth?
> > With every crawl you only fetch the already known links.
> >
> > Matthias
> >
> > --
> > http://www.eventax.com - eventax GmbH
> > http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events
> >
> 



-- 
TEL 0512-68251233-6966
MSN:[EMAIL PROTECTED]
Mail:[EMAIL PROTECTED]
QQ:58624951
BenQ.com <http://BenQ.com>
268 Shishan Road, New District, 
Suzhou, China

[Nutch-dev] Re: Why Crawl failed to fetch so many pages?

Reply via email to