Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by Gal Nitzan: http://wiki.apache.org/nutch/FAQ ------------------------------------------------------------------------------ - = Nutch FAQ = + This is the official Nutch FAQ. - Please feel free to answer and add questions! + [[TableOfContents]] - Please also have a look at the error messages, their reasons and solutions + == Nutch FAQ == + === General === - * General - * Injecting - * Fetching - * Updating - * Indexing - * Segment Handling - * Searching - * Crawling + ==== Are there any mailing lists available? ==== - == General == - '''How do I tell the ''Nutch Servlet'' where the index file are located?''' - * There are at least two ways to do that: - 1. bla - 2. bla - == Injecting == + There's a user, developer, commits and agents lists, all available at http://lucene.apache.org/nutch/mailing_lists.html#Agents . + ==== My system does not find the segments folder. Why? OR How do I tell the ''Nutch Servlet'' where the index file are located? ==== - '''What happens if I inject urls several times?''' - * Urls, which are already in the database, won't be injected. + There are at least two choices to do that: - ---- + 1) First you need to copy the .WAR file to the servlet container webapps folder. + % cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war + + * After building your first index, start Tomcat from the index folder. + Assuming your index is located at /index/db/ + % cd /index/db/ + % $CATATALINA_HOME/bin/startup.sh + * After building your first index, start Tomcat from the index folder. + Start Tomcat + % $CATATALINA_HOME/bin/startup.sh + Stop Tomcat + % $CATATALINA_HOME/bin/startup.sh + Tomcat has extracted the contens of the ROOT.war file + Edit the nutch-default.xml which is located at: + $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/ + look for the entry: searcher.dir and replace it with your index location /index/db + + === Injecting === + + ==== What happens if I inject urls several times? ==== + + Urls, which are already in the database, won't be injected. == Fetching == - '''Is it possible to fetch only pages from some specific domains?''' + ==== Is it possible to fetch only pages from some specific domains? ==== - * Please have a look on PrefixURLFilter. - * Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively. + Please have a look on PrefixURLFilter. + Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively. - '''How can I recover an aborted fetch process?''' - * You have two choices: - 1) Use the aborted output. You'll need to touch the file fetcher.done in the segment directory. All the pages that were not crawled will be re-generated for fetch pretty soon. If you fetched lots of pages, and don't want to have to re-fetch them again, this is the best way. - 2) Discard the aborted output. To do this, just delete the fetcher* directories in the segment and restart the fetcher. + ==== How can I recover an aborted fetch process? ==== + You have two choices: + 1) Use the aborted output. You'll need to touch the file fetcher.done in the segment directory. All the pages that were not crawled will be re-generated for fetch pretty soon. If you fetched lots of pages, and don't want to have to re-fetch them again, this is the best way. + 2) Discard the aborted output. To do this, just delete the fetcher* directories in the segment and restart the fetcher. + - '''Who changes the next fetch date?''' + ==== Who changes the next fetch date? ==== * After injecting a new url the next fetch date is set to the current time. * Generating a fetchlist enhances the date by 7 days. * Updating the db sets the date to the current time + db.default.fetch.interval - 7 days. - '''I have a big fetchlist in my segments folder. How can I fetch only some sites at a time?''' + ==== I have a big fetchlist in my segments folder. How can I fetch only some sites at a time? ==== * You have to decide how many pages you want to crawl before generating segments and use the options of bin/nutch generate. * Use -topN to limit the amount of pages all together. * Use -numFetchers to generate multiple small segments. * Now you could either generate new segments. Maybe you whould use -adddays to allow bin/nutch generate to put all the urls in the new fetchlist again. Add more then 7 days if you did not make a updatedb. * Or send the process a unix STOP signal. You should be able to index the part of the segment for crawling which is allready fetched. Then later send a CONT signal to the process. Do not turn off your computer between! :) - '''How can I force fetcher to use custom nutch-config?''' + ==== How can I force fetcher to use custom nutch-config? ==== * Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig * Copy these files from $NUTCH_HOME/conf to the new directory: common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, regex-normalize.xml, regex-urlfilter.txt * Modify the nutch-default.xml to suite your needs @@ -60, +71 @@ * run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment variable. You should check the command outputs for lines where the configs are loaded, that they are really loaded from your custom dir. * Happy using. - - ---- - - == Updating == + === Updating === - ---- + === Indexing === - == Indexing == + ==== Is it possible to change the list of common words without crawling everything again? ==== - '''Is it possible to change the list of common words without crawling everything again?''' + ==== How do I index my local file system?==== - '''How do I index my local file system?''' The tricky thing about Nutch is that out of the box is has most plugins disabled and is tuned for a crawl of a "remote" web server - you '''have''' to change config files to get it to crawl your local disk. 1. crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites. Change this line: + The tricky thing about Nutch is that out of the box it has most plugins disabled and is tuned for a crawl of a "remote" web server - you '''have''' to change config files to get it to crawl your local disk. - -^(file|ftp|mailto|https): + 1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites. - to this: + Change this line: - -^(http|ftp|mailto|https): + -^(file|ftp|mailto|https): - 2. crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok: + to this: + -^(http|ftp|mailto|https): - # accept anything else - +.* - 3. By default the [http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html "file plugin"] is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this: + 2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok: + # accept anything else + +.* + + 3. By default the [http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html "file plugin"] is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this: + - <property> + <property> - <name>plugin.includes</name> + <name>plugin.includes</name> - <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value> + <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value> - </property> + </property> Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned [http://www.mozilla.org/quality/networking/testing/filetests.html here] and this behavior may be disabled by a [http://www.mozilla.org/quality/networking/docs/netprefs.html preference] (see security.checkloaduri). IE5 does not have this problem. - '''While indexing documents, I get the following error:''' ''050529 011245 fetch okay, but can't parse myfile, reason: Content truncated at 65536 bytes. Parser can't handle incomplete msword file.'' '''What is happening?''' + ==== While indexing documents, I get the following error: ==== - By default, the size of the documents downloaded by Nutch is limited (to 65536 bytes). To allow Nutch to download larger files (via HTTP), modify nutch-site.xml and add an entry like this: + ''050529 011245 fetch okay, but can't parse myfile, reason: Content truncated at 65536 bytes. Parser can't handle incomplete msword file.'' + '''What is happening?''' - <property> - <name>http.content.limit</name> - <value>'''150000'''</value> - </property> + By default, the size of the documents downloaded by Nutch is limited (to 65536 bytes). To allow Nutch to download larger files (via HTTP), modify nutch-site.xml and add an entry like this: - If you do not want to limit the size of downloaded documents, set http.content.limit to a negative value. - ---- - == Segment Handling == + <property> + <name>http.content.limit</name> + <value>====150000====</value> + </property> + If you do not want to limit the size of downloaded documents, set http.content.limit to a negative value. + + === Segment Handling === + - '''Do I have to delete old segments after some time?''' + ==== Do I have to delete old segments after some time? ==== * If you're fetching regularly, segments older than the db.default.fetch.interval can be deleted, as their pages should have been refetched. This is 30 days by default. + === Searching === - ---- - - == Searching == - - '''First Search: My system does not find the segments folder. Why?''' - * Please have a look at the nutch-site.xml for the Webserver. (WEB-INF/classes/nutch-site.xml) Go to searcher.dir and set the path to the dir with the segments and index folders or search-servers.txt. - - '''Common words are saturating my search results.''' + ====Common words are saturating my search results.==== * You can tweak your conf/common-terms.utf8 file after creating an index through the following command:
