[Nutch Wiki] Update of "FAQ" by Gal Nitzan

Apache Wiki Tue, 20 Sep 2005 01:19:17 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by Gal Nitzan:
http://wiki.apache.org/nutch/FAQ

------------------------------------------------------------------------------
  Adding some regular expressions to the urlfilter.regex.file might work, but 
adding a list with thousands of regular expressions would slow down your system 
excessively.
  
  ==== How can I recover an aborted fetch process? ====
+ 
  You have two choices:
     1) Use the aborted output. You'll need to touch the file fetcher.done in 
the segment directory. All the pages that were not crawled will be re-generated 
for fetch pretty soon. If you fetched lots of pages, and don't want to have to 
re-fetch them again, this is the best way.
     2) Discard the aborted output. To do this, just delete the fetcher* 
directories in the segment and restart the fetcher.
@@ -74, +75 @@

    * Or send the process a unix STOP signal. You should be able to index the 
part of the segment for crawling which is allready fetched. Then later send a 
CONT signal to the process. Do not turn off your computer between! :)
  
  ==== How can I force fetcher to use custom nutch-config? ====
+ 
    * Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig
    * Copy these files from $NUTCH_HOME/conf to the new directory: 
common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, 
regex-normalize.xml, regex-urlfilter.txt
    * Modify the nutch-default.xml to suite your needs
@@ -113, +115 @@

      # accept anything else
      +.*
  
-   3. By default the 
[http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html 
"file plugin"] is disabled. nutch-site.xml needs to be modified to allow this 
plugin. Add an entry like this:
+   3) By default the 
[http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html 
"file plugin"] is disabled. nutch-site.xml needs to be modified to allow this 
plugin. Add an entry like this:
  
      <property>
        <name>plugin.includes</name>
@@ -129, +131 @@

  '''What is happening?'''
  
    By default, the size of the documents downloaded by Nutch is limited (to 
65536 bytes). To allow Nutch to download larger files (via HTTP), modify 
nutch-site.xml and add an entry like this:
+     <property>
+       <name>http.content.limit</name>
+       <value>'''150000'''</value>
+     </property>
  
-   <property>
-     <name>http.content.limit</name>
-     <value>====150000====</value>
-   </property>
- 
-   If you do not want to limit the size of downloaded documents, set 
http.content.limit to a negative value.
+   If you do not want to limit the size of downloaded documents, set 
http.content.limit to a negative value:
+     <property>
+       <name>http.content.limit</name>
+       <value>'''-1'''</value>
+     </property>
  
  === Segment Handling ===
  
  ==== Do I have to delete old segments after some time? ====
+ 
-   * If you're fetching regularly, segments older than the 
db.default.fetch.interval can be deleted, as their pages should have been 
refetched. This is 30 days by default.
+ If you're fetching regularly, segments older than the 
db.default.fetch.interval can be deleted, as their pages should have been 
refetched. This is 30 days by default.
  
  === Searching ===
  
  ==== Common words are saturating my search results. ====
  
-   * You can tweak your conf/common-terms.utf8 file after creating an index 
through the following command:
+ You can tweak your conf/common-terms.utf8 file after creating an index 
through the following command:
+   bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index
  
- bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index
+ ==== What ranking algorithm is used in searches ? Does Nutch use the 
[http://en.wikipedia.org/wiki/HITS_algorithm HITS algorithm] ? ====
  
+ N/A yet
- 
- Q: What ranking algorithm is used in searches ? Does Nutch use the 
[http://en.wikipedia.org/wiki/HITS_algorithm HITS algorithm] ?
  
  === Crawling ===

[Nutch Wiki] Update of "FAQ" by Gal Nitzan

Reply via email to