Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by Sergey: http://wiki.apache.org/nutch/FAQ ------------------------------------------------------------------------------ ==== What happens if I inject urls several times? ==== Urls, which are already in the database, won't be injected. - - ==== Java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml ==== - - This really is a crawl tool issue, but is covered here as weel: The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seeds -dir /user/nutchuser... === Fetching === @@ -166, +162 @@ ==== While fetching I get UnknownHostException for known hosts ==== Make sure your DNS server is working and/or it can handle the load of requests. - - - ==== It seems as if not all links are followed in the pages in my URL lists ==== - - 1.) Make sure that your expressions in conf/crawl-urlfilter.txt are correct, perhaps the links are dropped there. - - 2.) Make sure that in conf/nutch-site.xml the following parameters are set appropriate: - - * http.content.limit: otherwise some content my never be fetched at all - * db.max.outlinks.per.page: otherwise the links might be dropped. - - 3.) Make sure you have the parse-js and all other necessary plugins active in conf/nutch-site.xml === Updating === @@ -351, +335 @@ ==== How is scoring done in Nutch? (Or, explain the "explain" page?) ==== - Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does it. The formula Lucene uses scoring can be found at the head of the Lucene Similarity class in the [http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html Lucene Similarity Javadoc]. Lucene scoring looks to be based on the Vector Space Model of Information Retrieval science. Roughly, the score for a particular document in a set of query results, "score(q,d)", is the sum of the score for each term of a query ("t in q"). A terms score in a document is itself the sum of the term run against each field that comprises a document ("title" is one field, "url" another. A "document" is a set of "fields"). Per field, the score is the product of the following factors: Its "td" (term freqency in the document), a score factor "idf" (a factor made up of frequency of term relative to amount of docs in index), an index-time boost, a normalization of count of terms found relative to size of document ("lengthNorm"), a similar normalization is done for the term in the query itself ("queryNorm"), and finally, a factor with a weight for how many instances of the total amount of terms a particular document contains. Study the lucene javadoc to get more detail on each of the equation components and how they effect overall score. + Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does it. The formula Lucene uses scoring can be found at the head of the Lucene Similarity class in the [http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html Lucene Similarity Javadoc]. Roughly, the score for a particular document in a set of query results, "score(q,d)", is the sum of the score for each term of a query ("t in q"). A terms score in a document is itself the sum of the term run against each field that comprises a document ("title" is one field, "url" another. A "document" is a set of "fields"). Per field, the score is the product of the following factors: Its "td" (term freqency in the document), a score factor "idf" (usually a factor made up of frequency of term relative to amount of docs in index), an index-time boost, a normalization of count of terms found relative to size of document ("lengthNorm"), a similar normalization is done for the term in the query i tself ("queryNorm"), and finally, a factor with a weight for how many instances of the total amount of terms a particular document contains. Study the lucene javadoc to get more detail on each of the equation components and how they effect overall score. Interpreting the Nutch "explain.jsp", you need to have the above cited Lucene scoring equation in mind. First, notice how we move right as we move from "score total", to "score per query term", to "score per query document field" (A document field is not shown if a term was not found in a particular field). Next, studying a particular field scoring, it comprises a query component and then a field component. The query component includes query time -- as opposed to index time -- boost, an "idf" that is same for the query and field components, and then a "queryNorm". Similar for the field component ("fieldNorm" is an aggregation of certain of the Lucene equation components). @@ -373, +357 @@ Results as RSS (XML) rather than HTML are easier for programmatic clients to parse: such clients will query against [http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html OpenSearchServlet] rather than search.jsp. Results as XML can also be transformed using XSL stylesheets, the likely direction of UI development going forward. - ==== How can I enable Porter Stemming in Nutch? ==== - Stemming in nutch can be implemented in the following ways. However, you will have to modify some internal files. - Thanks to Howie Wang for developing the code for version 0.7.2. I updated his code for 0.8 and it should work without any issues. - - [[http://wiki.apache.org/nutch/Stemming#head-3d04863d929228d17d0a8402a32c3900f7b13be7 Stemming for Version 0.7.2]] - - [[http://wiki.apache.org/nutch/Stemming#head-d314c23de10878dc34d76337632908d9f940e907 Stemming for Version 0.8]] - === Crawling === ==== Java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml ==== - The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seeds -dir /user/nutchuser... + The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser... - === ReCrawling === - Here are scripts to help you with Intranet recrawling. - ==== Version 0.7.2 ==== - Place in your main Nutch directory. - - [http://wiki.apache.org/nutch/IntranetRecrawl#head-76dc88d48baa3f429f58e9827ff2debba1f66098 Recrawl-0.7.2] - ==== Version 0.8.0 ==== - Place in the bin sub-directory of Nutch. - - [http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 Recrawl-0.8.0] === Discussion === [http://grub.org/ Grub] has some interesting ideas about building a search engine using distributed computing. ''And how is that relevant to nutch?'' ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs