Hi Kevin,

   Are you using the crawl tool for crawling? If no, then you should look at
the regex patterns in regex-urlfilter.txt.

   Also, look at the db.max.outlinks.per.page property in nutch-site.xml.
This property determines how many outlinks from a page are processed.

Regards,

-vishal.

-----Original Message-----
From: kevin.Y [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 24, 2007 4:22 PM
To: [email protected]
Subject: why did nutch miss so many links when crawling?


hi! 
I got a problem when using nutch-0.9 to crawl a site.
There was an article-list-page , and in this page there were many links
which point to the article-pages.
So i made nutch crawl starting with the list-page so that those articles
could be indexed.
However during the crawling i found nutch ignored all those article links !
At last , none of those articles but some other pages could be indexed. I
tried several times and got the same problem.
I'm sure there's no problem with the conf/crawl-urlfilter.txt.(
+^http://([a-z0-9]*\.)*site.com/ )
Doesn't nutch pull out all the links from a page and crawl them all? Have i
made some stupid mistakes?
any help ??

any reply will be great appreciated!
-- 
View this message in context:
http://www.nabble.com/why-did-nutch-miss-so-many-links-when-crawling--tf4322
916.html#a12310200
Sent from the Nutch - User mailing list archive at Nabble.com.


Reply via email to