try +^http://news.buaa.edu.cn/*
On 4/13/05, cao yuzhong <[EMAIL PROTECTED]> wrote: > > I want crawl all the pages in http://news.buaa.edu.cn > > Following is my crawl-urlfilter.txt: > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto|https): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ > > # skip URLs containing certain characters as probable queries, etc. > # [EMAIL PROTECTED] > > # accept anything else > #+. > +^http://news.buaa.edu.cn > I started crawl at the url: http://news.buaa.edu.cn and the crawl depth is > 50. > But I found many pages in http://news.buaa.edu.cn have not been fetched. > > For example: > Url http://news.buaa.edu.cn/sortnews.php?type=1 have been fetched. > It links to many pages. > Some of those pages have been fetched also. > such as > http://news.buaa.edu.cn/dispnews.php?type=1&nid=2508&s_table=news_txt > > but many failed to be fetched,such as: > http://news.buaa.edu.cn/dispnews.php?type=1&nid=2500&s_table=news_txt > http://news.buaa.edu.cn/dispnews.php?type=1&nid=2504&s_table=news_txt > http://news.buaa.edu.cn/dispnews.php?type=1&nid=2505&s_table=news_txt > > What's wrong about the Crawl? > > Best regards. > Cao Yuzhong > 2005-04-13 > > ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
