Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by KaiMiddleton:
http://wiki.apache.org/nutch/FAQ

------------------------------------------------------------------------------
      Assuming your index is located at /index :
      {{{% cd /index/
  % $CATATALINA_HOME/bin/startup.sh}}}
-     '''Now you can search.''
+     '''Now you can search.'''
  
    2) After building your first index, start and stop Tomcat which will make 
Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and 
put in it the location of the index folder.
      {{{% $CATATALINA_HOME/bin/startup.sh
@@ -391, +391 @@

  </property>
  }}}
  After that, __don't forget to crawl again__ and you should be able to 
retrieve the mime-type and content-length through the class HitDetails (via the 
fields "primaryType", "subType" and "contentLength") as you normally do for the 
title and URL of the hits.
-       (Note by DanielLopez) Thanks to Doğacan Güney for the tip.
+       (Note by DanielLopez) Thanks to Dogacan Güney for the tip.
  
  === Crawling ===
  
@@ -399, +399 @@

  
  The crawl tool expects as its first parameter the folder name where the 
seeding urls file is located so for example if your urls.txt is located in 
/nutch/seeds the crawl command would look like: crawl seed -dir 
/user/nutchuser...
  
- ==== Some pages are not indexed but my regex file and everything else is okay 
- what is going on? ====
+ ==== Nutch doesn't crawl relative URLs? Some pages are not indexed but my 
regex file and everything else is okay - what is going on? ====
  The crawl tool has a default limitation of 100 outlinks of one page that are 
being fetched.
- To overcome this limitation change the property to a higher value or simply 
-1 (unlimited).
+ To overcome this limitation change the '''db.max.outlinks.per.page''' 
property to a higher value or simply -1 (unlimited).
  
  file: conf/nutch-default.xml
  

Reply via email to