Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by ra: http://wiki.apache.org/nutch/FAQ ------------------------------------------------------------------------------ The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser... - ==== Some pages are not indexed but my regex file and everyhing else is okay - what is going on? ==== + ==== Some pages are not indexed but my regex file and everything else is okay - what is going on? ==== The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. - To overcome this limitation change the property to a higher value or simply -1. + To overcome this limitation change the property to a higher value or simply -1 (unlimited). file: conf/nutch-default.xml + {{{ <property> <name>db.max.outlinks.per.page</name> @@ -415, +416 @@ </property> }}} see also: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08665.html - + (tested under nutch 0.9)