Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by KaiMiddleton: http://wiki.apache.org/nutch/FAQ ------------------------------------------------------------------------------ Assuming your index is located at /index : {{{% cd /index/ % $CATATALINA_HOME/bin/startup.sh}}} - '''Now you can search.'' + '''Now you can search.''' 2) After building your first index, start and stop Tomcat which will make Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and put in it the location of the index folder. {{{% $CATATALINA_HOME/bin/startup.sh @@ -391, +391 @@ </property> }}} After that, __don't forget to crawl again__ and you should be able to retrieve the mime-type and content-length through the class HitDetails (via the fields "primaryType", "subType" and "contentLength") as you normally do for the title and URL of the hits. - (Note by DanielLopez) Thanks to DoÄacan Güney for the tip. + (Note by DanielLopez) Thanks to Dogacan Güney for the tip. === Crawling === @@ -399, +399 @@ The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser... - ==== Some pages are not indexed but my regex file and everything else is okay - what is going on? ==== + ==== Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on? ==== The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. - To overcome this limitation change the property to a higher value or simply -1 (unlimited). + To overcome this limitation change the '''db.max.outlinks.per.page''' property to a higher value or simply -1 (unlimited). file: conf/nutch-default.xml