please modify below (# skip URLs containing certain characters as probable queries, etc. # [EMAIL PROTECTED] because the link http://news.buaa.edu.cn/dispnews.php?type=1&nid=2500&s_table=news_txt includes the ?=& which will be ignored it will be (# skip URLs containing certain characters as probable queries, etc. # [EMAIL PROTECTED])
2005/4/14, Andy Liu <[EMAIL PROTECTED]>: > > By default, Nutch only crawls the first 100 outlinks on a page. Maybe > that's your problem? > > On 4/14/05, Matthias Jaekle <[EMAIL PROTECTED]> wrote: > > > try > > > +^http://news.buaa.edu.cn/* > > This should not be the reason. > > Your regex fits on urls starting with: > > http://news.buaa.edu.cn > > http://news.buaa.edu.cn/ > > http://news.buaa.edu.cn// > > http://news.buaa.edu.cn/// ... > > > > The only thing I would try is to escape some caracters to make it more > > correct. A dot means every possible sign. Better: > > +^http:\/\/news\.buaa\.edu\.cn > > > > Did you make enough rounds, to get the wanted depth? > > With every crawl you only fetch the already known links. > > > > Matthias > > > > -- > > http://www.eventax.com - eventax GmbH > > http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events > > > -- TEL 0512-68251233-6966 MSN:[EMAIL PROTECTED] Mail:[EMAIL PROTECTED] QQ:58624951 BenQ.com <http://BenQ.com> 268 Shishan Road, New District, Suzhou, China
