Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "FAQ" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/FAQ?action=diff&rev1=128&rev2=129 . Change this line: -^(file|ftp|mailto|https): to this: -^(http|ftp|mailto|https): 2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok: . # accept anything else +.* - 3) By default the [[http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html|"file plugin"]] is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this: + 3) By default the protocol-file plugin is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this: {{{ <property> @@ -120, +120 @@ <value>protocol-file|...copy original values from nutch-default here...</value> </property> }}} - where you should copy and paste all values from nutch-default.xml in the plugin.includes setting provided there. This will ensure that all plug-ins normally enabled will be enabled, plus the protocol-file plugin. Make sure to include parse-pdf if you want to parse PDF files. Make sure that urlfilter-regexp is included, or else '''the *urlfilter files will be ignored''', leading nutch to accept all URLs. You need to enable crawl URL filters to prevent nutch from crawling up the parent directory, see below. + where you should copy and paste all values from nutch-default.xml in the plugin.includes setting provided there. This will ensure that all plug-ins normally enabled will be enabled, plus the protocol-file plugin. Make sure that urlfilter-regex is included, or else '''the urlfilter files will be ignored''', leadingNnutch to accept all URLs. You need to enable crawl URL filters to prevent Nutch from crawling up the parent directory, see below. - Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned [[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and this behavior may be disabled by a [[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] (see security.checkloaduri). IE5 does not have this problem. + Now you can invoke the crawler and index all or part of your disk. ==== Nutch crawling parent directories for file protocol ==== - If you find nutch crawling parent directories when using the file protocol, the following kludge may help: + If you find Nutch crawling parent directories when using the file protocol, the following Jira issue may help: http://issues.apache.org/jira/browse/NUTCH-407 E.g. for urlfilter-regex you could put the following in regex-urlfilter.txt : @@ -133, +133 @@ +^file:///c:/top/directory/ -. }}} - Alternatively, you could apply the patch described [[http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch|on this page]], which would avoid the hardwiring of the site-specific /top/directory in your configuration file. + Alternatively, you could apply the patch described [[http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch|on this page]], which would avoid the hard-wiring of the site-specific /top/directory in your configuration file. ==== How do I index remote file shares? ==== At the current time, Nutch does not have built in support for accessing files over SMB (Windows) shares. This means the only available method is to mount the shares yourself, then index the contents as though they were local directories (see above). Note that the share mounting method suffers from the following drawbacks: - . 1) The links generated by Nutch will not work except for queries from localhost (end users typically won't have the exact same shares mounted in the exact same way). 2) You are limited to the number of mounted shares your operating system supports. In *nix environments, this is effectively unlimited, but in Windows you may mount 26 (one share or drive per letter in the English alphabet) 3) Documents with links to shares are unlikely to work since they won't link to the share on your machine, but rather to the SMB version. + 1) The links generated by Nutch will not work except for queries from local host (end users typically won't have the exact same shares mounted in the exact same way) + 2) You are limited to the number of mounted shares your operating system supports. In *nix environments, this is effectively unlimited, but in Windows you may mount 26 (one share or drive per letter in the English alphabet) + 3) Documents with links to shares are unlikely to work since they won't link to the share on your machine, but rather to the SMB version. - ==== While indexing documents, I get the following error: ==== - ''050529 011245 fetch okay, but can't parse myfile, reason: Content truncated at 65536 bytes. Parser can't handle incomplete msword file.'' - - '''What is happening?''' - - . By default, the size of the documents downloaded by Nutch is limited (to 65536 bytes). To allow Nutch to download larger files (via HTTP), modify nutch-site.xml and add an entry like this: - - {{{ - <property> - <name>http.content.limit</name> - <value>150000</value> - </property> - }}} - . If you do not want to limit the size of downloaded documents, set http.content.limit to a negative value: - - {{{ - <property> - <name>http.content.limit</name> - <value>-1</value> - </property> - }}} === Segment Handling === ==== Do I have to delete old segments after some time? ==== If you're fetching regularly, segments older than the db.default.fetch.interval can be deleted, as their pages should have been refetched. This is 30 days by default. === MapReduce === ==== What is MapReduce? ==== - MapReduce + Please see the MapReduce page of the Nutch wiki. ==== How to start working with MapReduce? ==== - . edit conf/nutch-site.xml + . edit $HADOOP_HOME/conf/mapred-site.xml <property> . <name>fs.default.name</name> <value>localhost:9000</value> <description>The name of the default file system. Either the literal string "local" or a host:port for NDFS.</description> </property> @@ -241, +222 @@ Deleted /user/root/crawl-20050927144626 }}} + === Scoring === + ==== How can I influence Nutch scoring? ==== + Scoring is implemented as a filter plugin, i.e. an implementation of the !ScoringFilter class. By default, the OPIC Scoring Filter is used. There are also numerous scoring filter properties which can be specified within nutch-site.xml. + === Searching === - ==== Common words are saturating my search results. ==== - You can tweak your conf/common-terms.utf8 file after creating an index through the following command: - - . bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index - - ==== How is scoring done in Nutch? (Or, explain the "explain" page?) ==== - Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does it. The formula Lucene uses scoring can be found at the head of the Lucene Similarity class in the [[http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html|Lucene Similarity Javadoc]]. Roughly, the score for a particular document in a set of query results, "score(q,d)", is the sum of the score for each term of a query ("t in q"). A terms score in a document is itself the sum of the term run against each field that comprises a document ("title" is one field, "url" another. A "document" is a set of "fields"). Per field, the score is the product of the following factors: Its "tf" (term freqency in the document), a score factor "idf" (usually a factor made up of frequency of term relative to amount of docs in index), an index-time boost, a normalization of count of terms found relative to size of document ("lengthNorm"), a similar normalization is done for the term in the query itself ("queryNorm"), and finally, a factor with a weight for how many instances of the total amount of terms a particular document contains. Study the lucene javadoc to get more detail on each of the equation components and how they effect overall score. - - Interpreting the Nutch "explain.jsp", you need to have the above cited Lucene scoring equation in mind. First, notice how we move right as we move from "score total", to "score per query term", to "score per query document field" (A document field is not shown if a term was not found in a particular field). Next, studying a particular field scoring, it comprises a query component and then a field component. The query component includes query time -- as opposed to index time -- boost, an "idf" that is same for the query and field components, and then a "queryNorm". Similar for the field component ("fieldNorm" is an aggregation of certain of the Lucene equation components). - - ==== How can I influence Nutch scoring? ==== - Scoring is implemented as a filter plugin, i.e. an implementation of the !ScoringFilter class. By default, [[http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/scoring/opic/OPICScoringFilter.html|OPICScoringFilter]] is used. - - However, the easiest way to influence scoring is to change query time boosts (Will require edit of nutch-site.xml and redeploy of the WAR file). Query-time boost by default looks like this: - - {{{ - query.url.boost, 4.0f - query.anchor.boost, 2.0f - query.title.boost, 1.5f - query.host.boost, 2.0f - query.phrase.boost, 1.0f - }}} - From the list above, you can see that terms found in a document URL get the highest boost with anchor text next, etc. - - Anchor text makes a large contribution to document score (You can see the anchor text for a page by browsing to "explain" then editing the URL to put in place "anchors.jsp" in place of "explain.jsp"). - - ==== What is the RSS symbol in search results all about? ==== - Clicking on the RSS symbol sends the current query back to Nutch to a servlet named [[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html|OpenSearchServlet]]. [[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html|OpenSearchServlet]] reruns the query and returns the results formatted instead as RSS (XML). The RSS format is based on [[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch RSS 1.0]] from [[http://www.a9.com|a9.com]]: "[[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch]] RSS 1.0 is an extension to the RSS 2.0 standard, conforming to the guidelines for RSS extensibility as outlined by the RSS 2.0 specification" (See also [[http://opensearch.a9.com/|opensearch]]). Nutch in turn makes extension to [[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch]]. The Nutch extensions are identified by the 'nutch' namespace prefix and add to [[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch]] navigation information, the original query, and all fields that are available at search result time including the Nutch page boost, the name of the segment the page resides in, etc. - - Results as RSS (XML) rather than HTML are easier for programmatic clients to parse: such clients will query against [[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html|OpenSearchServlet]] rather than search.jsp. Results as XML can also be transformed using XSL stylesheets, the likely direction of UI development going forward. - ==== How can I find out/display the size and mime type of the hits that a search returns? ==== In order to be able to find this information you have to modify the standard {{{plugin.includes}}} property of the nutch configuration file and add the {{{index-more}}} filter. {{{ <property> <name>plugin.includes</name> - <value>...|index-more|...|query-more|...</value> + <value>...|index-more|...|...</value> ... </property> }}} After that, __don't forget to crawl again__ and you should be able to retrieve the mime-type and content-length through the class HitDetails (via the fields "primaryType", "subType" and "contentLength") as you normally do for the title and URL of the hits. - . (Note by DanielLopez) Thanks to Dogacan Güney for the tip. - === Crawling === - ==== Java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml ==== - The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser... ==== Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on? ==== The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change the '''db.max.outlinks.per.page''' property to a higher value or simply -1 (unlimited). @@ -306, +256 @@ </description> </property> }}} - see also: http://www.mail-archive.com/[email protected]/msg08665.html (tested under nutch 0.9) + see also: http://www.mail-archive.com/[email protected]/msg08665.html === Discussion === [[http://grub.org/|Grub]] has some interesting ideas about building a search engine using distributed computing. ''And how is that relevant to nutch?''

