Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by Gal Nitzan: http://wiki.apache.org/nutch/FAQ ------------------------------------------------------------------------------ Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively. ==== How can I recover an aborted fetch process? ==== + You have two choices: 1) Use the aborted output. You'll need to touch the file fetcher.done in the segment directory. All the pages that were not crawled will be re-generated for fetch pretty soon. If you fetched lots of pages, and don't want to have to re-fetch them again, this is the best way. 2) Discard the aborted output. To do this, just delete the fetcher* directories in the segment and restart the fetcher. @@ -74, +75 @@ * Or send the process a unix STOP signal. You should be able to index the part of the segment for crawling which is allready fetched. Then later send a CONT signal to the process. Do not turn off your computer between! :) ==== How can I force fetcher to use custom nutch-config? ==== + * Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig * Copy these files from $NUTCH_HOME/conf to the new directory: common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, regex-normalize.xml, regex-urlfilter.txt * Modify the nutch-default.xml to suite your needs @@ -113, +115 @@ # accept anything else +.* - 3. By default the [http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html "file plugin"] is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this: + 3) By default the [http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html "file plugin"] is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this: <property> <name>plugin.includes</name> @@ -129, +131 @@ '''What is happening?''' By default, the size of the documents downloaded by Nutch is limited (to 65536 bytes). To allow Nutch to download larger files (via HTTP), modify nutch-site.xml and add an entry like this: + <property> + <name>http.content.limit</name> + <value>'''150000'''</value> + </property> - <property> - <name>http.content.limit</name> - <value>====150000====</value> - </property> - - If you do not want to limit the size of downloaded documents, set http.content.limit to a negative value. + If you do not want to limit the size of downloaded documents, set http.content.limit to a negative value: + <property> + <name>http.content.limit</name> + <value>'''-1'''</value> + </property> === Segment Handling === ==== Do I have to delete old segments after some time? ==== + - * If you're fetching regularly, segments older than the db.default.fetch.interval can be deleted, as their pages should have been refetched. This is 30 days by default. + If you're fetching regularly, segments older than the db.default.fetch.interval can be deleted, as their pages should have been refetched. This is 30 days by default. === Searching === ==== Common words are saturating my search results. ==== - * You can tweak your conf/common-terms.utf8 file after creating an index through the following command: + You can tweak your conf/common-terms.utf8 file after creating an index through the following command: + bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index - bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index + ==== What ranking algorithm is used in searches ? Does Nutch use the [http://en.wikipedia.org/wiki/HITS_algorithm HITS algorithm] ? ==== + N/A yet - - Q: What ranking algorithm is used in searches ? Does Nutch use the [http://en.wikipedia.org/wiki/HITS_algorithm HITS algorithm] ? === Crawling ===
