Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "FAQ" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/FAQ?action=diff&rev1=127&rev2=128 Please visit our [[http://lucene.apache.org/nutch/bot.html|"webmaster info page"]] ==== Will Nutch be a distributed, P2P-based search engine? ==== - We don't think it is presently possible to build a peer-to-peer search engine that is competitive with existing search engines. It would just be too slow. Returning results in less than a second is important: it lets people rapidly reformulate their queries so that they can more often find what they're looking for. In short, a fast search engine is a better search engine. I don't think many people would want to use a search engine that takes ten or more seconds to return results. + We don't think it is presently possible to build a peer-to-peer search engine that is competitive with existing search engines. It would just be too slow. Returning results in less than a second is important: it lets people rapidly reformulate their queries so that they can more often find what they're looking for. In short, a fast search engine is a better search engine. We don't think many people would want to use a search engine that takes ten or more seconds to return results. That said, if someone wishes to start a sub-project of Nutch exploring distributed searching, we'd love to host it. We don't think these techniques are likely to solve the hard problems Nutch needs to solve, but we'd be happy to be proven wrong. @@ -27, +27 @@ ==== What Java version is required to run Nutch? ==== Nutch 0.7 will run with Java 1.4 and up. Nutch 1.0 with Java 6. - ==== Exception: java.net.SocketException: Invalid argument or cannot assign requested address on Fedora Core 3 or 4 ==== - It seems you have installed IPV6 on your machine. - - To solve this problem, add the following java param to the java instantiation in bin/nutch: - - JAVA_IPV4=-Djava.net.preferIPv4Stack=true - - # run it exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath "$CLASSPATH" $CLASS "$@" - ==== I have two XML files, nutch-default.xml and nutch-site.xml, why? ==== - nutch-default.xml is the out of the box configuration for nutch. Most configuration can (and should unless you know what your doing) stay as it is. nutch-site.xml is where you make the changes that override the default settings. The same goes to the servlet container application. + nutch-default.xml is the out of the box configuration for Nutch, and most configurations can (and should unless you know what your doing) stay as per. nutch-site.xml is where you make the changes that override the default settings. - ==== My system does not find the segments folder. Why? Or: How do I tell the ''Nutch Servlet'' where the index file are located? ==== - There are at least two choices to do that: - . First you need to copy the .WAR file to the servlet container webapps folder. - - {{{ - % cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war - }}} - . 1) After building your first index, start Tomcat from the index folder. - . Assuming your index is located at /index : - - {{{ - % cd /index/ - % $CATATALINA_HOME/bin/startup.sh - }}} - . '''Now you can search.''' - - . 2) After building your first index, start and stop Tomcat which will make Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and put in it the location of the index folder. - - {{{ - % $CATATALINA_HOME/bin/startup.sh - % $CATATALINA_HOME/bin/shutdown.sh - }}} - {{{ - % vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml - <?xml version="1.0"?> - <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?> - - <nutch-conf> - - <property> - <name>searcher.dir</name> - <value>/your_index_folder_path</value> - </property> - - </nutch-conf> - - % $CATATALINA_HOME/bin/startup.sh - }}} === Injecting === ==== What happens if I inject urls several times? ==== Urls which are already in the database, won't be injected. === Fetching === ==== Is it possible to fetch only pages from some specific domains? ==== - Please have a look on PrefixURLFilter. Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively. + Please have a look on PrefixURLFilter. Adding some regular expressions to the regex-urlfilter.txt file might work, but adding a list with thousands of regular expressions would slow down your system excessively. Alternatively, you can set db.ignore.external.links to "true", and inject seeds from the domains you wish to crawl (these seeds must link to all pages you wish to crawl, directly or indirectly). Doing this will let the crawl go through only these domains without leaving to start crawling external links. Unfortunately there is no way to record external links encountered for future processing, although a very small patch to the generator code can allow you to log these links to hadoop.log. @@ -92, +45 @@ Well, you can not. However, you have two choices to proceed: . 1) Recover the pages already fetched and than restart the fetcher. - . You'll need to create a file fetcher.done in the segment directory and then: [[http://wiki.apache.org/nutch/bin/nutch_updatedb|updatedb]], [[http://wiki.apache.org/nutch/bin/nutch_generate|generate]] and [[http://wiki.apache.org/nutch/bin/nutch_fetch|fetch]] . Assuming your index is at /index + . You'll need to create a file fetcher.done in the segment directory and then: [[http://wiki.apache.org/nutch/bin/nutch_updatedb|updatedb]], [[http://wiki.apache.org/nutch/bin/nutch_generate|generate]] and [[http://wiki.apache.org/nutch/bin/nutch_fetch|fetch]] . Assuming your crawl data is at /crawl {{{ % touch /index/segments/2005somesegment/fetcher.done - % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/ + % bin/nutch updatedb /crawl/db/ /crawl/segments/2005somesegment/ - % bin/nutch generate /index/db/ /index/segments/2005somesegment/ + % bin/nutch generate /crawl/db/ /crawl/segments/2005somesegment/ - % bin/nutch fetch /index/segments/2005somesegment + % bin/nutch fetch /crawl/segments/2005somesegment }}} . All the pages that were not crawled will be re-generated for fetch. If you fetched lots of pages, and don't want to have to re-fetch them again, this is the best way. @@ -121, +74 @@ * Or send the process a unix STOP signal. You should be able to index the part of the segment for crawling which is allready fetched. Then later send a CONT signal to the process. Do not turn off your computer between! :) ==== How many concurrent threads should I use? ==== + This is dependent on your particular set-up, unless one understands system/network environment variables it is impossible to accurately measure thread performance. The Nutch de-facto is an excellent start point. - This is dependent on your particular setup, but the following works for me: - - If you are using a slow internet connection (ie- DSL), you might be limited to 40 or fewer concurrent fetches. - - If you have a fast internet connection (> 10Mb/sec) your bottleneck will definitely be in the machine itself (in fact you will need multiple machines to saturate the data pipe). Empirically I have found that the machine works well up to about 1000-1500 threads. - - To get this to work on my Linux box I needed to set the ulimit to 65535 (ulimit -n 65535), and I had to make sure that the DNS server could handle the load (we had to speak with our colo to get them to shut off an artificial cap on the DNS servers). Also, in order to get the speed up to a reasonable value, we needed to set the maximum fetches per host to 100 (otherwise we get a quick start followed by a very long slow tail of fetching). - - To other users: please add to this with your own experiences, my own experience may be atypical. ==== How can I force fetcher to use custom nutch-config? ==== * Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig * Copy these files from $NUTCH_HOME/conf to the new directory: common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, regex-normalize.xml, regex-urlfilter.txt * Modify the nutch-default.xml to suite your needs - * Set NUTCH_CONF_DIR environment variable to point into the directory you created + * Set NUTCH_CONF_DIR environment variable in $NUTCH_HOME/bin/nutch to point into the directory you created * run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment variable. You should check the command outputs for lines where the configs are loaded, that they are really loaded from your custom dir. * Happy using. ==== bin/nutch generate generates empty fetchlist, what can I do? ==== The reason for that is that when a page is fetched, it is timestamped in the webdb. So basiclly if its time is not up it will not be included in a fetchlist. So for example if you generated a fetchlist and you deleted the segment dir created. calling generate again will generate an empty fetchlist. So, two choices: + 1) Change your system date to be 30 days from today (if you haven't changed the default settings) and re-run bin/nutch generate + 2) Call bin/nutch generate with the -adddays 30 (if you haven't changed the default settings) to make generate think the time has come... After generate you can call bin/nutch fetch. - . 1) Change your system date to be 30 days from today (if you haven't changed the default settings) and re-run bin/nutch generate... 2) Call bin/nutch generate with the -adddays 30 (if you haven't changed the default settings) to make generate think the time has come... After generate you can call bin/nutch fetch. - - ==== While fetching I get UnknownHostException for known hosts ==== - Make sure your DNS server is working and/or it can handle the load of requests. ==== How can I fetch pages that require Authentication? ==== - See [[HttpAuthenticationSchemes]]. + See the [[HttpAuthenticationSchemes]] wiki page. === Updating === ==== Isn't there redudant/wasteful duplication between nutch crawldb and solr index? ====

