Re: Nutch ML cleanup
ogjunk-nu...@yahoo.com is a member of nutch-...@lists.sourceforge.net and nutch-gene...@lists.sourceforge.net. These lists do not otherwise appear to forward to Apache lists. They used to perhaps forward through nutch.org lists, but that domain no longer forwards any email. Please check the message headers to see how this message is routed to you. If it is indeed routed through Apache servers then please send the headers to me. Doug Andrzej Bialecki wrote: Otis Gospodnetic wrote: Hi, This has been bugging me for a while now. For some reason Nutch MLs get the most junk emails - both rude/rudeish emails, as well as clear spam (with SPAM in the subject - something must be detecting it). I just looked at the headers of the clearly labeled spam messages and found that they all seem to come from SF: To: nutch-...@lists.sourceforge.net To: nutch-gene...@lists.sourceforge.net I assume there is some kind of a mail forward from the old Nutch MLs on SF to the new Nutch MLs at ASF. Do you think we could remove this forwarding and get rid of this spam? Sami Andrzej seem to be members who mght be able to make this change: http://sourceforge.net/project/memberlist.php?group_id=59548 Actually, only Doug and Mike Cafarella are admins of that project. Doug, could you please disable this forwarding?
Re: Plans on releasing another bug fix release?
Will the next release really be 1.0 or will it be 0.10? Doug Briggs wrote: I was just curious to know if there were any plans to release a maintenence/bug-fix release before 1.0. I know there have been a slew of patches and such (it's almost impossible to keep up, unless someone has a suggestion on how to keep track of these, I may be missing something), and was wondering when/if these would be applied to the trunk and labeled as say 0.9.1. Briggs.
Re: JIRA email question
The problem is that nutch-dev (like most Apache mailing lists) sets the Reply-to header to be itself, so that responses don't go back to the sender. If you override this when responding (changing the To: line) and respond to the sender, then it should end up as a comment, which will be then copied to nutch-dev. But there's unfortunately no way to automatically override this. Thus its best to click on the link in the message and respond directly in Jira. This is also more reliable. Sending messages to Jira doesn't always seem to work correctly. It might be good to disable that sentence suggesting that folks reply to the email, but I don't know if that's possible. Doug Doğacan Güney wrote: Hi list, There is this sentence at the end of every JIRA message: You can reply to this email to add a comment to the issue online. But, replying to a JIRA message through nutch-dev doesn't add it as a comment. So you have to either reply to an email through JIRA (in which case, it looks like you are responding to an imaginary person:) or through email (in which case, part of the discussion doesn't get documented in JIRA). Why doesn't this work?
[jira] Commented: (NUTCH-479) Support for OR queries
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507473 ] Doug Cutting commented on NUTCH-479: Neither. It would end up as the Lucene query: +search phrase +category:cat1 category:cat2 where category:cat2 is a non-required clause that just impacts ranking, not the set of documents returned. As for nested queries, parsing is only half the problem. The query filter plugins would need to be extended to handle such things, as they presently expect flat queries. The query foo bar currently expands to a Lucene query that looks something like: +(anchor:foo title:foo content:foo) +(anchor:bar title:bar content:bar) anchor:foo bar~10 title:foo bar~1000 content:foo bar~1000 (The latter three boost scores when terms are nearer. Anchor proximity is limited, to keep from matching anchors from other documents.) So, how should (foo AND (bar OR baz) expand? Probably something like: +(anchor:foo title:foo content:foo) +((anchor:bar title:bar content:bar) (anchor:baz title:baz content:baz)) ... proximity boosting clauses?... And (foo OR (bar AND baz)) might expand to: (anchor:foo title:foo content:foo) (+(anchor:bar title:bar content:bar) +(anchor:baz title:baz content:baz)) ... proximity boosting clauses?... This expansion is done by the query-basic plugin. Support for OR queries -- Key: NUTCH-479 URL: https://issues.apache.org/jira/browse/NUTCH-479 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: or.patch There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Fwd: Nutch 0.9 and Crawl-Delay]
Does the 0.9 crawl-delay implementation actually permit multiple threads to access a site simultaneously? Doug Original Message Subject: Nutch 0.9 and Crawl-Delay Date: Sun, 3 Jun 2007 10:50:24 +0200 From: Lutz Zetzsche [EMAIL PROTECTED] Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Dear Nutch developers, I have had problems with a Nutch based robot during the last 12 hours, which I have now solved by banning this particular bot from my server (not Nutch completely for the moment). The ilial bot, which created considerable load on my server, was using the latest Nutch version - v0.9 - which is now also supporting the crawl-delay directive in the robots.txt. The bot seems to have obeyed the directive - crawl-delay: 10 - as it visited my website every 15 seconds, which would have been ok, BUT it then submitted FIVE requests at once (see example log extract below)! 5 requests at once every 15 seconds is not acceptable on my server, which is principally serving dynamic content and is often visited by up to 10 search engines at the same time, alltogether surely creating 99.9% of the server traffic. So my suggestion is that Nutch only submits one request each time, when it detects a crawl-delay directive in the robots.txt. This is the behaviour, the MSNbot shows for example. The MSNbot also liked to submit several requests at once every few seconds, until I added the crawl-delay directive to my robots.txt. Best wishes Lutz Zetzsche http://www.sea-rescue.de/ 72.44.58.191 - - [03/Jun/2007:04:40:53 +0200] GET /english/Photos+%26+Videos/PV/ HTTP/1.0 200 13661 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; [EMAIL PROTECTED]) 72.44.58.191 - - [03/Jun/2007:04:40:53 +0200] GET /english/Links/WRGL/Countries/ HTTP/1.0 200 15048 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; [EMAIL PROTECTED]) 72.44.58.191 - - [03/Jun/2007:04:40:53 +0200] GET /islenska/Hlekkir/Brede-ger%C3%B0%20%2F%2033%20fet/ HTTP/1.0 200 60041 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; [EMAIL PROTECTED]) 66.249.72.244 - - [03/Jun/2007:04:40:55 +0200] GET /francais/Liens/Philip+Vaux/Brede%20%2F%2033%20pieds/ HTTP/1.1 200 17568 - Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 66.231.189.119 - - [03/Jun/2007:04:40:55 +0200] GET /english/Links/Martijn%20Koenraad%20Hof/Netherlands%20Antilles/Sint%20Maarten/ HTTP/1.0 200 17193 - Gigabot/2.0 (http://www.gigablast.com/spider.html) 74.6.86.105 - - [03/Jun/2007:04:40:56 +0200] GET /dansk/Links/Hermann+Apelt/ HTTP/1.0 200 30496 - Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) 72.44.58.191 - - [03/Jun/2007:04:40:53 +0200] GET /italiano/Links/Giamaica/MRCCs+%26+Stazioni+radio+costiera/ HTTP/1.0 200 16658 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; [EMAIL PROTECTED]) 72.44.58.191 - - [03/Jun/2007:04:40:53 +0200] GET /english/Links/Mauritius/Countries/Organisations/ HTTP/1.0 200 15624 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; [EMAIL PROTECTED])
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500822 ] Doug Cutting commented on NUTCH-392: Anchors, explain, and the cache are used relatively infrequently, considerably less than once per query, and hence *much* less than once per displayed hit. So it might be acceptable if they're somewhat slower. Block compression should still be fast-enough for interactive use, and these uses would never dominate CPU use in an application, would they? OutputFormat implementations should pass on Progressable Key: NUTCH-392 URL: https://issues.apache.org/jira/browse/NUTCH-392 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Doug Cutting Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-392.patch OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations. This will keep reduce tasks from timing out when block writes are slow. This issue depends on http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: proposal for committer
Personnel discussions are conducted on the PMC's private mailing list. I have forwarded your message there. Thanks for the suggestion! Doug Gal Nitzan wrote: Hi, Since I'm no committer I can't really propose :-) but I just thought to draw some attention to the great work done on the dev/users lists and also the many patches created by Do?acan G?ney... Just my 2 cents... Gal.
Re: NUTCH-348 and Nutch-0.7.2
karthik085 wrote: How do you find when a revision was released? Look at the tags in subversion: http://svn.apache.org/viewvc/lucene/nutch/tags/ Doug
Re: ApacheCon in Amsterdam
Tom White wrote: I will be there too. Unfortunately I won't be able to attend after all. The new baby in the house won't let me! Doug
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Arun Kaundal wrote: Actually nutch people are kind of autocrate., don't expect more from them They do what they have decided Have you submitted patches that have been ignored or rejected? Each Nutch contributor indeed does what he or she decides. Nutch is not a service organization that implements every feature that someone requests. It is a collaborative project of volunteers. Each contributor adds things they need, and others share the benefits. I am waiting really stable product with incremental indexing, which detect and add/remove pages as soon as they added/removed. But they don't want to this, i don't know why ? Perhaps because this is difficult, especially while still supporting large crawls. But if others don't want to implement this, I encourage you to try to implement it, and, if you succeed, contribute it back to the project. That's the way Nutch grows. what is there mission ? If we join together to implement this, it would be better. I can work on this as weekend project. ping me, if u want You can of course fork Nutch, or start a new project from scratch. But you ought to also consider submitting patches to Nutch, working work with other contributors to solve your problems here before abandoning Nutch in favor of another project. Cheers, Doug
Re: Image Search Engine Input
Steve Severance wrote: I am not looking to really make an image retrieval engine. During indexing referencing docs will be analyzed and text content will be associated with the image. Currently I want to keep this in a separate index. So despite the fact that images will be returned the search will be against text data. So do you just want to be able to reference the cached images? In that case, I think the images should stay in the content directory and be accessed like cached pages. The parse should just contain enough metadata to index so that the images can be located in the cache. I don't see a reason to keep this in a separate index, but perhaps a separate field instead? Then when displaying hits you can look up associated images and display them too. Does that work? Steve Severance wrote: I like Mathijs's suggestion about using a DB for holding thumbnails. I just want access to be in constant time since I am going to probably need to grab at least 10 and maybe 50 for each query. That can be kept in the plugin as an option or something like that. Does that have any ramifications for being run on Hadoop? I'm not sure how a database solves scalability issues. It seems to me that thumbnails should be handled similarly to summaries. They should be retrieved in parallel from segment data in a separate pass once the final set of hits to be displayed has been determined. Thumbnails could be placed in a directory per segment as a separate mapreduce pass. I don't see this as a parser issue, although perhaps it could be piggybacked on that mapreduce pass, which also processes content. Doug
Re: svn commit: r516643 - in /lucene/nutch/trunk/src/plugin/parse-html/src: java/org/apache/nutch/parse/html/DOMContentUtils.java test/org/apache/nutch/parse/html/TestDOMContentUtils.java
[EMAIL PROTECTED] wrote: [ ... ] -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with [ ... ] +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with This kind of thing is very unfortunate, since it makes it very difficult to figure out when particular lines were changed. I recommend always previewing commits with something like 'svn diff | less' before committing so that you can be sure to *only* commit changes that you intend. If your development environment does not permit you to preview the commit then please run subversion from the shell. Doug
[jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478854 ] Doug Cutting commented on NUTCH-455: Alternately, we could define it as an error to attempt to dedup by a tokenized field. That's the (undocumented) expectation of FieldCache. Using documents to populate a FieldCache for tokenized fields is very slow. It's better to add an untokenized version and use that, no? If you agree, then the more appropriate fix is to document the restriction and try to check for it at runtime. dedup on tokenized fields is faulty --- Key: NUTCH-455 URL: https://issues.apache.org/jira/browse/NUTCH-455 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Fix For: 0.9.0 Attachments: IndexSearcherCacheWarm.patch (From LUCENE-252) nutch uses several index servers, and the search results from these servers are merged using a dedup field for for deleting duplicates. The values from this field is cached by Lucene's FieldCachImpl. The default is the site field, which is indexed and tokenized. However for a Tokenized Field (for example url in nutch), FieldCacheImpl returns an array of Terms rather that array of field values, so dedup'ing becomes faulty. Current FieldCache implementation does not respect tokenized fields , and as described above caches only terms. So in the situation that we are searching using url as the dedup field, when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of the url (such as www or com) rather that the whole url. This prevents using tokenized fields in the dedup field. I have written a patch for lucene and attached it in http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the aforementioned issue about tokenized field caching. However building such a cache for about 1.5M documents takes 20+ secs. The code in IndexSearcher.translateHits() starts with if (dedupField != null) dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); and for the first call of search in IndexSearcher, cache is built. Long story short, i have written a patch against IndexSearcher, which in constructor warms-up the caches of wanted fields(configurable). I think we should vote for LUCENE-252, and then commit the above patch with the last version of lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Issues pending before 0.9 release
Sami Siren wrote: It would be more beneficial to everybody if the discussions (related to release or Nutch) is done on public (hey this is open source!). The off the list stuff IMO smells. +1 Folks sometimes wish to discuss project matters off-list to spare others the boring details, but this is usually a bad idea. All project decisions should be made in public on this list. Discussions relevant to these decisions are also thus best made on this list, since they explain the decision. Private discussions are permissible to develop a proposal, but that is usually better done on-list when possible, so that others can get involved earlier. (The one notable exception is that personnel issues are discussed on the private PMC list.) Doug
Re: FW: Nutch release process help
Chris Mattmann wrote: It's too bad that this has turned out to be an issue that I've handled incorrectly, and for that, I apologize. Sorry if I blew this out of proportion. We all help each other run this project. I don't think any grave error was made. I just saw an opportunity to remind folks to try to keep project discussions public, and did not mean to rebuke you. I am thrilled that you want to take on the responsibility of making a release. I very much do not want to damp your enthusiasm for that. As you probably know, the release documentation is at: http://wiki.apache.org/nutch/Release_HOWTO This may need to be updated. You might also look at the release documentation for other projects, to get ideas. http://wiki.apache.org/lucene-hadoop/HowToRelease http://wiki.apache.org/solr/HowToRelease http://wiki.apache.org/jakarta-lucene/ReleaseTodo Cheers, Doug
Re: Nutch JSF front-end code submission - Please advice next steps?
Zaheed Haque wrote: Its been about a month I been trying to find time to make the necessary changes so that I could submit the code. Due to enormous amount of work load I am unable to find the time. I am not sure how should I proceed, I have personally try to contact some of you off list. (Which I thought might be interested as they discuss more web apps related issue on the list ). But seems like everyone is busy. So I am trying my last effort here. I would love someone do something with the code rather then it becomes obsolete. For a start, please attach it to an issue in Jira, as-is, so that it is not lost. Doug
[jira] Commented: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476243 ] Doug Cutting commented on NUTCH-445: Note that the site field is also used for search-time deduplication, and that assumes that each document has only one value for the field (returned from a Lucene FieldCache with raw hits, for performance). So this feature should perhaps use a separate field. That said, I think this should replace the current site-search feature, as it is an improvement and the industry-standard semantics. So perhaps a site: query should search the domain: field? Domain İndexing / Query Filter -- Key: NUTCH-445 URL: https://issues.apache.org/jira/browse/NUTCH-445 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: index_query_domain_v1.0.patch, index_query_domain_v1.1.patch, TranslatingRawFieldQueryFilter_v1.0.patch Hostname's contain information about the domain of th host, and all of the subdomains. Indexing and Searching the domains are important for intuitive behavior. From DomainIndexingFilter javadoc : Adds the domain(hostname) and all super domains to the index. * br For http://lucene.apache.org/nutch/ the * following will be added to the index : br * ul * lilucene.apache.org /li * liapache/li * liorg /li * /ul * All hostnames are domain names, but not all the domain names are * hostnames. In the above example hostname lucene is a * subdomain of apache.org, which is itself a subdomain of * org br * Currently Basic indexing filter indexes the hostname in the site field, and query-site plugin allows to search in the site field. However site:apache.org will not return http://lucene.apache.org By indexing the domain, we can be able to search domains. Unlike the site field (indexed by BasicIndexingFilter) search, searching the domain field allows us to retrieve lucene.apache.org to the query apache.org. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Performance optimization for Nutch index / query
Andrzej Bialecki wrote: The degree of simplification is very substantial. Our NutchSuperQuery doesn't have to do much more work than a simple TermQuery, so we can assume that the cost to run it is the same as TermQuery times some constant. What we gain then is the cost of not running all those boolean clauses ... The NutchSuperQuery would have to do more work, to boost things and since postings would be longer, and postings would also compress more poorly, so while there'd probably be some improvement, it wouldn't be quite as fast as a single-term query. If you're still with me at this point I must congratulate you. :) However, that's as far as I thought it through for now - let the discussion start! If you are a Lucene hacker I would gladly welcome your review or even code contributions .. ;) An implementation to consider is payloads. If each posting has a weight attached, then the fieldBoost*fieldNorm could be stored there, and a simple gap-based method could be used to inhibit cross-field matches. Queries would look similar to your proposed approach. http://www.gossamer-threads.com/lists/lucene/java-dev/37409 One might optimize the payload implementation with run-length compression: if a run of postings have the same payload it could be represented once at the start of the run along with the run's length. That would keep postings small, reducing i/o. Doug
[jira] Assigned: (NUTCH-449) Format of junit output should be configurable
[ https://issues.apache.org/jira/browse/NUTCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting reassigned NUTCH-449: -- Assignee: Doug Cutting Format of junit output should be configurable - Key: NUTCH-449 URL: https://issues.apache.org/jira/browse/NUTCH-449 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1 Reporter: Nigel Daley Assigned To: Doug Cutting Priority: Minor Attachments: hudson.patch Allow the junit output format to be set by a system property. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
nightly builds moved to hudson
Nutch's nightly builds have been moved to a Hudson server at: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ I've stopped the old nightly build process and added a redirect from the old nightly build distribution directory to this page. Thanks to Nigel Daley for configuring and maintaining the Hudson server! Doug
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472821 ] Doug Cutting commented on NUTCH-443: this patch in some places removes the log guards Most of the log guards are misguided. Log guards should only be used on DEBUG level messages in performance-critical inner loops. Since INFO is the expected log level, a guard on INFO WARN level messages does not improve performance, since these will be shown. And most DEBUG-level messages are not in performance critical code and hence do not need guards. The guards only make the code bigger and thus harder to read and maintain. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Chris A. Mattmann Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
log guards
Doug Cutting (JIRA) wrote: this patch in some places removes the log guards Most of the log guards are misguided. Log guards should only be used on DEBUG level messages in performance-critical inner loops. Since INFO is the expected log level, a guard on INFO WARN level messages does not improve performance, since these will be shown. And most DEBUG-level messages are not in performance critical code and hence do not need guards. The guards only make the code bigger and thus harder to read and maintain. In particular, in all places where we check isWarnEnabled(), isFatalEnabled() and isInfoEnabled(), the 'if' should be removed. All calls to isDebugEnabled() should be reviewed, and most should be removed. These guards were all introduced by a patch some time ago. I complained at the time and it was promised that this would be repaired, but it has not yet been. Doug
Re: RSS-fecter and index individul-how can i realize this function
Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? So now the question is, how much impact would this change to the Parser API have on the rest of Nutch? It would require changes to all Parser implementations, to ParseSegement, to ParseUtil, and to Fetcher. But, as far as I can tell, most of these changes look straightforward. Doug
Re: RSS-fecter and index individul-how can i realize this function
Chris Mattmann wrote: Sorry to be so thick-headed, but could someone explain to me in really simple language what this change is requesting that is different from the current Nutch API? I still don't get it, sorry... A Content would no longer generate a single Parse. Instead, a Content could potentially generate many Parses. For most types of content, e.g., HTML, each Content would still generate a single Parse. But for RSS, a Content might generate multiple Parses, each indexed separately and each with a distinct URL. Another potential application could be processing archives: the parser could unpack the archive and each item in it indexed separately rather than indexing the archive as a whole. This only makes sense if each item has a distinct URL, which it does in RSS, but it might not in an archive. However some archive file formats do contain URLs, like that used by the Internet Archive. http://www.archive.org/web/researcher/ArcFileFormat.php Does that help? Doug
Re: RSS-fecter and index individul-how can i realize this function
Doğacan Güney wrote: OK, then should I go forward with this and implement something? This should be pretty easy, though I am not sure what to give as keys to a Parse[]. I mean, when getParse returned a single Parse, ParseSegment output them as url, Parse. But, if getParse returns an array, what will be the key for each element? Perhaps Parser#parser could return a MapString,Parse, where the keys are URLs? Something like url#i, Parse[i] may work, but this may cause problems in dedup(for example, assume we fetched the same rss feed twice, and indexed them in different indexes. Two version's url#0 may be different items but since they have the same key, dedup will delete the older). If the feed contains unique ids for items, then that can be used to qualify the URL. Otherwise one could use the hash of the link of the item. Since the target of the link must still be indexed separately from the item itself, how much use is all this? If the RSS document is considered a single page that changes frequently, and item's links are considered ordinary outlinks, isn't much the same effect achieved? Doug
Re: RSS-fecter and index individul-how can i realize this function
Renaud Richardet wrote: The usecase is that you index RSS-feeds, but your users can search each feed-entry as a single document. Does it makes sense? But each feed item also contains a link whose content will be indexed and that's generally a superset of the item. So should there be two urls indexed per item? In many cases, the best thing to do is to index only the linked page, not the feed item at all. In some (rare?) cases, there might be items without a link, whose only content is directly in the feed, or where the content in the feed is complementary to that in the linked page. In these cases it might be useful to combine the two (the feed item and the linked content), indexing both. The proposed change might permit that. Is that the case you're concerned about? Doug
Re: RSS-fecter and index individul-how can i realize this function
Doğacan Güney wrote: I think it would make much more sense to change parse plugins to take content and return Parse[] instead of Parse. You're right. That does make more sense. Doug
Re: RSS-fecter and index individul-how can i realize this function
Gal Nitzan wrote: IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Perhaps ProtocolOutput should change. The method: Content getContent(); could be deprecated and replaced with: Content[] getContents(); This would require changes to the indexing pipeline. I can't think of any severe complications, but I haven't looked closely. Could something like that work? Doug
Re: i18n in nutch home page is misnomor
Teruhiko Kurosaka wrote: I suggest i18n be renamed to l10n, short for localization. Can you please file an issue in Jira for this? Ideally you could even provide a patch. The source for the website is in subversion at: http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/site Forrest is used to generate the site from this. http://forrest.apache.org/ Doug
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
Scott Ganyo (JIRA) wrote: ... since Hadoop hijacks and reassigns all log formatters (also a bad practice!) in the org.apache.hadoop.util.LogFormatter static constructor ... FYI, Hadoop no longer does this. Doug
Re: Next Nutch release
Dennis Kubes wrote: Andrzej Bialecki wrote: I believe that at this point it's crucial to keep the project well-focused (at the moment I think the main focus is on larger installations, and not the small ones), and also to make Nutch attractive to developers as a reusable search engine component. I think there are two areas. One is to keep the focus as you stated above. The other is to provide a path to get more people involved. If no one objects I will continue working on such a path. Please let me know if I can help in this people area. I'm currently unable to assist with technical Nutch issues on a day-to-day basis, but I am still very interested in doing what I can to ensure Nutch's long-term vitality as a project. Cheers, Doug
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
Chris Mattmann wrote: So, does this render the patch that I wrote obsolete? It's at least out-of-date and perhaps obsolete. A quick read of Fetcher.java looks like there might be a case where a fatal error is logged but the fetcher doesn't exit, in FetcherThread#output(). Doug
Re: Finished How to Become a Nutch Developer
[EMAIL PROTECTED] wrote: Draft version of How to Become a Nutch Developer is on the wiki at: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer Please take a look and if you think anything needs to be added, removed, or changed let me know. Thanks for taking the time to write this up! It looks great. I hope to meet, as we say it in Texas, ya'll in person one day. Indeed! Ironically, I met a bunch of Lucene developers for the first time in Texas last fall, at ApacheCon. I hope to attend both ApacheCon EU this May in Amsterdam and ApacheCon US this November in Atlanta. Maybe I'll meet you there? Doug
Re: How to Become a Nutch Developer
Andrzej Bialecki wrote: The workflow is different - I'm not sure about the details, perhaps Doug can correct me if I'm wrong ... and yes, it uses JIRA extensively. 1. An issue is created 2. patches are added, removed commented, etc... 3. finally, a candidate patch is selected, and the issue is marked Patch available. Patch Available is code for the contributor now believes this is ready to commit. Once a patch is in this state, a committer reviews it and either commits it or rejects it, changing the state of the issue back to Open. The set of issues in Patch Available thus forms a work queue for committers. We try not to let a patch sit in this state for more than a few days. 4. An automated process applies the patch to a temporary copy, and checks whether it compiles and passes junit tests. This is currently hosted by Yahoo!, run by Nigel Daley, but it wouldn't be hard to run this for Nutch on lucene.zones.apache.org, and I think Nigel would probably gladly share his scripts. This step saves committers time: if a patch doesn't pass unit tests, or has javadoc warnings, etc. this can be identified automatically. 5. A list of patches in this state is available, and committers may pick from this list and apply them. 6. An explicit link is made between the issue and the change set committed to svn (Is this automated?) Jira does this based on commit messages. Any bug ids mentioned in a commit message create links from that bug to the revision in subversion. Hadoop commits messages usually start with the bug id, e.g., HADOOP-1234. Remove a deadlock in the oscillation overthruster. 7. The issue is marked as Resolved, but not closed. I believe issues are closed only when a release is made, because issues in state resolved make up the Changelog. I believe this is also automated. Jira will put resolved issues into the release notes regardless of whether they're closed. The reason we close issues on release is to keep folks from re-opening them. We want the release notes to be the list of changes in a release, so we don't want folks re-opening issues and having new commits made against them, since then the changes related to the issue will span multiple releases. If an issue is closed but there's still a problem, a new issue should be created linking to the prior issue, so that the new issue can be scheduled and tracked without modifying what should be a read-only release. Doug
Re: Reviving Nutch 0.7
[EMAIL PROTECTED] wrote: Yes, certainly, anything that can be shared and decoupled from pieces that make each branch (not SVN/CVS branch) different, should be decoupled. But I was really curious about whether people think this is a valid idea/direction, not necessarily immediately how things should be implemented. In my mind, one branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS, etc. That's the branch that's in the trunk. The other branch is a simpler branch without all that Hadoop stuff, for folks who need to fetch, index, and search a few hundred thousand or a few million or even a few tens of millions of pages, and don't need replication, etc. that comes with Hadoop. That branch could be based off of 0.7. I also know that a lot of people are trying to use Nutch to build vertical search engines, so there is also a need for a focused fetcher. Kelvin Tan brought this up a few times, too, I believe. Branching doesn't sound like the right solution here. First, one doesn't need to run any Hadoop daemons to use Nutch: everything should run fine in a single process by default. If there are bugs in this they should be logged, folks who care should submit high-quality, back-compatible, generally useful patches, and committers should work to get these patches committed to the trunk. Second, if there are to be two modes of operation, wouldn't they best be developed in a common source tree, so that they share as much as possible and diverge as little as possible? It seems to me that a good architecture would be to agree on a common high-level API, then use two different runtimes underneath, one to support distributed operation, and one to support standalone operation. Hey! That's what Hadoop already does! Maybe it's not perfect and someone can propose a better way to share maximal amounts of code, but the code split should probably be into different classes and packages in a single source tree maintained by a single community of developers, not by branching a single source tree in a revision control and splitting the developers. Third, part of the problem seems like there are two few contributors--that the challenges are big and the resources limited. Splitting the project will only spread those resources more thinly. What really is the issue here? Are good patches languishing? Are there patches that should be committed (meet coding standards, are back-compatible, generally useful, etc.) but are not? A great patch is one that a committer can commit it with few worries: it includes new unit tests, it passes all existing unit tests, it fixes one thing only, etc. Such patches should not have to wait long for commit. And once someone submits a few such patches, then one should be invited to become a committer. It sounds to me like the problem is that, off-the-shelf, Nutch does not yet solve all the problems folks would like it too: e.g., it has never done a good job with incremental indexing. Folks see progress made on scalability, but really wish it were making more progress on incrementality or something else. But it's not going to make progress on incrementality without someone doing the work. A fork or a branch isn't going to do the work. I don't see any reason that the work cannot be done right now. It can be done incrementally: e.g., if the web db API seems inappropriate for incremental updates, then someone should submit a patch that provides an incremental web db API, updating the fetcher and indexer to use this. A design for this on the wiki would be a good place to start. Finally, web crawling, indexing and searching are data-intensive. Before long, users will want to index tens or hundreds of millions of pages. Distributed operation is soon required at this scale, and batch-mode is an order-of-magnitude faster. So be careful before you threw those features out: you might want them back soon. Doug
Re: How to Become a Nutch Developer
Dennis Kubes wrote: Can you answer the question of how to add developer names to JIRA or if that is only for committers? It's not just for committers, but also for regular contributors. I have added you. Anyone else? Doug
Re: Next Nutch release
Stefan Groschupf wrote: I don't want to start a emotional discussion here, however talking about the problem in public might help. What, specifically, is the problem you perceive? Doug
Re: Next Nutch release
Dennis Kubes wrote: I will say that it is difficult for people to understand how to get more involved. I have been working with Nutch and Hadoop for almost a year now on a daily basis and only now am I understanding how to contribute through jira, etc. There needs to be more guidance in helping developers contribute. For example if you want to develop a new piece of function they do x, y, and z. Here is how to patch your system. If you want to develop a patch then here are the steps. The closest thing we have currently are the HowToContribute pages: http://wiki.apache.org/nutch/HowToContribute http://wiki.apache.org/lucene-hadoop/HowToContribute http://wiki.apache.org/jakarta-lucene/HowToContribute These are not great, but they're a start. Are there parts that are confusing? Do they assume too much? Are they missing things? If so, please help to update these. I note that the Nutch version is less evolved than the Lucene and Hadoop versions. Doug
Re: Next Nutch release
Stefan Groschupf wrote: We run the gui in several production environemnts with patched hadoop code - since this is from our point of view the clean approach. Everything else feels like a workaround to fix some strange hadoop behaviors. Are there issues in Hadoop's Jira for these? If so, do they have patches attached? Are they linked to the corresponding issue in Nutch? Doug
Re: How can I get one plugin's root dir
Andrzej Bialecki wrote: The reason is that if you pack this file into your job JAR, the job jar would become very large (presumably this 40MB is already compressed?). Job jar needs to be copied to each tasktracker for each task, so you will experience performance hit just because of the size of the job jar ... whereas if this file sits on DFS and is highly replicated, its content will always be available locally. Note that the job jar is copied into HDFS with a highish replication (10?), and that it is only copied to each tasktracker node once per *job*, not per task. So it's only faster to manage this yourself if you have a sequence of jobs that share this data, and if the time to re-replicate it per job is significant. Doug
Re: Brochure for Nutch
The wiki would be a good place for this. Doug Peter Landolt wrote: Hello, We tried to introduce Nutch at a telecommunication company in Switzerland as search engine of their future main search solution. As they were also proofing commercial products we needed to offer them a brochure to make sure they understand Nutch as a product with its features, references, etc. This document was mainly created to advise the credability of Nutch. Attached please find the document. Now we would like you to proof and improve the document to publish it on the web. Thanks for your feedback and best regards, Peter
Re: What's the status of Nutch-GUI?
Sami Siren wrote: Stefan Groschupf wrote: See: http://www.find23.net/nutch_guiToHadoop.pdf Section required hadoop changes. I quess you refer to these: • LocalJobRunner: • Run as kind of singelton • Have a kind of jobQueue • Implement JobSubmissionProtocol status-report methods • implement killJob method Is there an issue in Hadoop's Jira for this? Is there a patch that implements these? If there is, then I suggest folks vote for the issue. -how about writing a nutchrunner that just extends the functionality of localjobrunner? -scheduling (jobQueue) could be completely outside of jobrunner? These also sounds like a good solutions. If it is not Nutch-specific, then perhaps it could be integrated into Hadoop, so that it is maintained as Hadoop evolves. If that sounds like a good approach, please submit a patch to Hadoop with some unit tests. Cheers, Doug
[jira] Assigned: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting reassigned NUTCH-392: -- Assignee: Doug Cutting OutputFormat implementations should pass on Progressable Key: NUTCH-392 URL: http://issues.apache.org/jira/browse/NUTCH-392 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Doug Cutting Assigned To: Doug Cutting OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations. This will keep reduce tasks from timing out when block writes are slow. This issue depends on http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting updated NUTCH-392: --- Attachment: NUTCH-392.patch OutputFormat implementations should pass on Progressable Key: NUTCH-392 URL: http://issues.apache.org/jira/browse/NUTCH-392 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Doug Cutting Assigned To: Doug Cutting Attachments: NUTCH-392.patch OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations. This will keep reduce tasks from timing out when block writes are slow. This issue depends on http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ http://issues.apache.org/jira/browse/NUTCH-392?page=comments#action_12444719 ] Doug Cutting commented on NUTCH-392: This should not be applied until Nutch uses Hadoop 0.8. It also contains a patch required to make Nutch work correctly with Hadoop 0.8 (where LocalFileSystem.rename() of a non-existing file now throws an exception). OutputFormat implementations should pass on Progressable Key: NUTCH-392 URL: http://issues.apache.org/jira/browse/NUTCH-392 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Doug Cutting Assigned To: Doug Cutting Attachments: NUTCH-392.patch OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations. This will keep reduce tasks from timing out when block writes are slow. This issue depends on http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting updated NUTCH-392: --- Attachment: (was: NUTCH-392.patch) OutputFormat implementations should pass on Progressable Key: NUTCH-392 URL: http://issues.apache.org/jira/browse/NUTCH-392 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Doug Cutting Assigned To: Doug Cutting Attachments: NUTCH-392.patch OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations. This will keep reduce tasks from timing out when block writes are slow. This issue depends on http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ http://issues.apache.org/jira/browse/NUTCH-392?page=all ] Doug Cutting updated NUTCH-392: --- Attachment: NUTCH-392.patch Oops. Attached the wrong patch. Here's the right one. OutputFormat implementations should pass on Progressable Key: NUTCH-392 URL: http://issues.apache.org/jira/browse/NUTCH-392 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Doug Cutting Assigned To: Doug Cutting Attachments: NUTCH-392.patch OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations. This will keep reduce tasks from timing out when block writes are slow. This issue depends on http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)
Sami Siren wrote: looks like somebody just enabled email-to-jira-comments-feature. I was just wondering would it be good to use this feature more widely. I think it would be good. That way mailing list discussion would be logged to the bug as well. This could be achieved by removing the replyto header from messages coming from jira so that replies get sent to [EMAIL PROTECTED] (i am assuming that is possible). So whenever somebody just hits reply from email client and writes the comment it would get automatically attached to correct issue as a comment. I sent a message to [EMAIL PROTECTED] this morning asking about this. If it's possible, and no one objects, I will request it for the Nutch mailing lists. Doug
[jira] Resolved: (NUTCH-304) Change JIRA email address for nutch issues from apache incubator
[ http://issues.apache.org/jira/browse/NUTCH-304?page=all ] Doug Cutting resolved NUTCH-304. Resolution: Fixed I just fixed this. Thanks for noticing! Change JIRA email address for nutch issues from apache incubator Key: NUTCH-304 URL: http://issues.apache.org/jira/browse/NUTCH-304 Project: Nutch Issue Type: Task Environment: Dell Pentium M mobile 1.4 Ghz, 512 MB RAM, although task is independent of environment Reporter: Chris A. Mattmann Priority: Minor The default email address for Nutch issues in JIRA should be changed from nutch-dev@incubator.apache.org to [EMAIL PROTECTED] Could one of the commiters with appropriate jira privileges update the email? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439682 ] Doug Cutting commented on NUTCH-353: It's worth noting that Google, Yahoo! and Microsoft's searches all return lots of links to www-XXX.ibm.com. Just some evidence that this may not be an easy problem to solve. pages that serverside forwards will be refetched every time --- Key: NUTCH-353 URL: http://issues.apache.org/jira/browse/NUTCH-353 Project: Nutch Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Assigned To: Andrzej Bialecki Priority: Blocker Fix For: 0.9.0 Attachments: doNotRefecthForwarderPagesV1.patch Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Patch Available status?
Chris Mattmann wrote: +1. I think that workflow makes a lot of sense. Currently users in the nutch-developers group can close and resolve issues. In the Hadoop workflow, would this continue to be the case? In Hadoop, most developers can resolve but not close. Only members of a separate Jira group (hadoop-admin a subset of hadoop-developers) are permitted to close bugs. Note that the Jira group hadoop-developers has far more members than Hadoop has committers. But the nutch-developers Jira group pretty closely corresponds to Nutch's committers, so perhaps all committers should be permitted to close, although this should be exercised with caution, only at releases, since closes cannot be undone in this workflow. Another alternative would be to construct a new workflow that just adds the Patch Available status and still permits issues to be re-opened. Which sounds best for Nutch? Doug
Re: Patch Available status?
Sami Siren wrote: I am not able to do it either, or then I just don't know how, can Doug help us here? This requires a change the the project's workflow. I'd be happy to move Nutch to use the workflow we use for Hadoop, which supports Patch Available. This workflow has one other non-default feature, which is that bugs, once closed, cannot be re-opened. This works as follows: Only project administrators are allowed to close issues. Bugs are resolved as they're fixed, and only closed when a release is made. This keeps the release notes Jira generates from changing after a release is made. Would you like me to switch Nutch to use this Jira workflow? Doug
Re: Error with Hadoop-0.4.0
Sami Siren wrote: Patch works for me. OK. I just committed it. Thanks! Doug
Re: Error with Hadoop-0.4.0
Jérôme Charron wrote: In my environment, the crawl command terminate with the following error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273)) - Input directory /localpath/crawl/crawldb/current in local is invalid. Exception in thread main java.io.IOException: Input directory /localpathcrawl/crawldb/current in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.Injector.inject(Injector.java:146) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) Hadoop 0.4.0 by default requires all input directories to exist, where previous releases did not. So we need to either create an empty current directory or change the InputFormat used in CrawlDb.createJob() to be one that overrides InputFormat.areValidInputDirectories(). The former is probably easier. I've attached a patch. Does this fix things for folks? Doug Index: src/java/org/apache/nutch/crawl/CrawlDb.java === --- src/java/org/apache/nutch/crawl/CrawlDb.java (revision 417882) +++ src/java/org/apache/nutch/crawl/CrawlDb.java (working copy) @@ -65,7 +65,8 @@ if (LOG.isInfoEnabled()) { LOG.info(CrawlDb update: done); } } - public static JobConf createJob(Configuration config, Path crawlDb) { + public static JobConf createJob(Configuration config, Path crawlDb) +throws IOException { Path newCrawlDb = new Path(crawlDb, Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); @@ -73,7 +74,11 @@ JobConf job = new NutchJob(config); job.setJobName(crawldb + crawlDb); -job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); + +Path current = new Path(crawlDb, CrawlDatum.DB_DIR_NAME); +if (FileSystem.get(job).exists(current)) { + job.addInputPath(current); +} job.setInputFormat(SequenceFileInputFormat.class); job.setInputKeyClass(UTF8.class); job.setInputValueClass(CrawlDatum.class);
[jira] Reopened: (NUTCH-309) Uses commons logging Code Guards
[ http://issues.apache.org/jira/browse/NUTCH-309?page=all ] Doug Cutting reopened NUTCH-309: I am re-opening this issue, as the guards were added in far too many places. Jerome, can you please fix these so that guards are only added when (a) the log level is DEBUG or TRACE, (b) it occurs in performance-critical code, and (c) the logged string is not constant. Uses commons logging Code Guards Key: NUTCH-309 URL: http://issues.apache.org/jira/browse/NUTCH-309 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jerome Charron Assignee: Jerome Charron Priority: Minor Fix For: 0.8-dev Code guards are typically used to guard code that only needs to execute in support of logging, that otherwise introduces undesirable runtime overhead in the general case (logging disabled). Examples are multiple parameters, or expressions (e.g. string + more) for parameters. Use the guard methods of the form log.isPriority() to verify that logging should be performed, before incurring the overhead of the logging method call. Yes, the logging methods will perform the same check, but only after resolving parameters. (description extracted from http://jakarta.apache.org/commons/logging/guide.html#Code_Guards) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-312) Fix for upcoming incompatibility with Hadoop-0.4
[ http://issues.apache.org/jira/browse/NUTCH-312?page=all ] Doug Cutting resolved NUTCH-312: Fix Version: 0.8-dev Resolution: Fixed I just upgraded Nutch to Hadoop 0.4.0, incorporating this patch. Thanks, Milind! Fix for upcoming incompatibility with Hadoop-0.4 Key: NUTCH-312 URL: http://issues.apache.org/jira/browse/NUTCH-312 Project: Nutch Type: Improvement Environment: all Reporter: Milind Bhandarkar Fix For: 0.8-dev Attachments: nutch-latest.patch, nutch.patch I have submitted a patch to Hadoop fixing tasktracker-latency issues. That patch introduces incompatibility with current nutch code, because the interface for OutputFormat will change. I will soon submit a patch for nutch that will fix this upcoming incompatibility with Hadoop. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: svn commit: r416346 [1/3] - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/clustering/ java/org/apache/nutch/crawl/ java/org/apache/nutch/fetcher/ java/org/apach
[EMAIL PROTECTED] wrote: NUTCH-309 : Added logging code guards [ ... ] + if (LOG.isWarnEnabled()) { +LOG.warn(Line does not contain a field name: + line); + } [ ...] -1 I don't think guards should be added everywhere. They make the code bigger and provide little benefit. Rather, guards should only be added in performance critical code, and then only for Debug-level output. Info and Warn levels are normally enabled, and developers should thus not log messages at these levels so frequently that performance will be compromised. And not all Debug-level log statements need guards, only those that are in inner loops, where the construction of the log message may significantly affect performance. Doug
IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.html
Re: Nutch logging questions
Jérôme Charron wrote: For now, I have used the same log4 properties than hadoop (see http://svn.apache.org/viewvc/lucene/hadoop/trunk/conf/log4j.properties?view=markuppathrev=411254 ) for the back-end, and I was thinking to use the stdout for front-end. What do you think about this? We should use console rather than stdout, so that it can be distinguished from application output. http://issues.apache.org/jira/browse/HADOOP-292 Doug
Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar
Stefan Groschupf wrote: As far I understand hadoop use commons logging. Should we switch to use commons logging as well? +1 Doug
[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414114 ] Doug Cutting commented on NUTCH-289: It should be possible to partition by IP and limit fetchlists by IP. Resolving only in the fetcher is too late to implement these features. Ideally we should arrange things for good DNS cache utilization, so that urls with the same host are resolved in a single map or reduce task. Currently this is the case during fetchlist generation, where lists are partitioned by host. Might that be a good place to insert DNS resolution? The fetchlists would need to be processed one more time, to re-partition and re-limit by IP, but fetchlists are relatively small, so this might not slow things too much. The map task itself could directly cache IP addresses, and perhaps even avoid many DNS lookups by using the IP from another CrawlDatum from the same host. A multi-threaded mapper might also be used to allow for network latencies. This should, at least initially, be an optional feature, and thus the IP should probably initially be stored in the metadata. I think it might be added as a re-generate step without changing any other code. CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Mailing List nutch-agent Reports of Bots Submitting Forms
Ken Krugler wrote: 2. Are the Nutch Devs replying to the emails sent to this list? I could understand if they are replying off-list, but to an outside observer such as myself it appears as though webmasters are not getting many replies to their inqueries. I can speak for myself only .. I'm not tracking that list. What about others? Folks who are running a nutch-based crawler that provides this email address as the contact address should subscribe to this list and respond to messages, especially those which may have been caused by their crawler. Others are also encouraged to subscribe and help respond to messages here, as a bad reputation for the crawler affects the whole project. This list is actually fairly low-volume. This brings up an issue I've been thinking about. It might make sense to require everybody set the user-agent string, versus it having default values that point to Nutch. The first time you run Nutch, it would display an error re the user-agent string not being set, but if the instructions for how to do this were explicit, this wouldn't be much of a hardship for anybody trying it out. +1 That would be a better solution. Doug
[jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.
[ http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12413528 ] Doug Cutting commented on NUTCH-273: Redirects should really not be followed immediately anyway. We should instead note that it was redirected and to which URL in the fetcher output. Then, when the crawl db is updated with the fetcher output, the target of the redirect should be added, with the full OPIC score of the original URL. This will enable proper politeness guarantees. It would be nice to still associate the original URL with the content of the redirect URL when indexing. Perhaps a list of URLs that redirected to each page could be kept in the CrawlDatum metadata? Can anyone think of a better way to implement this? When a page is redirected, the original url is NOT updated. --- Key: NUTCH-273 URL: http://issues.apache.org/jira/browse/NUTCH-273 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: n/a Reporter: Lukas Vlcek [Excerpt from maillist, sender: Andrzej Bialecki] When a page is redirected, the original url is NOT updated - so, CrawlDB will never know that a redirect occured, it won't even know that a fetch occured... This looks like a bug. In 0.7 this was recorded in the segment, and then it would affect the Page status during updatedb. It should do so 0.8, too... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-289) CrawlDatum should store IP address
CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413272 ] Doug Cutting commented on NUTCH-288: Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? No. But we should probably handle this situation without throwing exceptions. For example, look at the following: http://www.google.com/search?q=emacs+%22doug+cutting%22start=90 Click on the page 19 link at the bottom. It takes you to page 16, the last page after deduplication. hitsPerSite-functionality flawed: problems writing a page-navigation -- Key: NUTCH-288 URL: http://issues.apache.org/jira/browse/NUTCH-288 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching. RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My naive approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I got an exception (see below). 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception java.lang.NegativeArraySizeException at org.apache.nutch.searcher.Hits.getHits(Hits.java:65) at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595) Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation
[ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413305 ] Doug Cutting commented on NUTCH-288: Is there a quickfix possible somehow? Someone needs to fix the OpenSearch servlet. It looks like just changing line 146 of OpenSearchServlet.java, replacing: Hit[] show = hits.getHits(start, end-start); with: Hit[] show = hits.getHits(start, length 0 ? length : 0); Give this a try. hitsPerSite-functionality flawed: problems writing a page-navigation -- Key: NUTCH-288 URL: http://issues.apache.org/jira/browse/NUTCH-288 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching. RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My naive approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I got an exception (see below). 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception java.lang.NegativeArraySizeException at org.apache.nutch.searcher.Hits.getHits(Hits.java:65) at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595) Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379116 ] Doug Cutting commented on NUTCH-267: re: it's as if we didn't want it to be re-crawled if we can't find any inlinks to it We prioritize crawling based on the number of pages we've crawled that link to it since we've last crawled it. Assuming it had links to it that caused it to be crawled the first time, and that some of those will also be re-crawled, then its score will again increase. But if no one links to it anymore, it will languish, and not be crawled again unless there're no higher-scoring pages. That sounds right to me, and I think it's what's suggested in the OPIC paper (if i skimmed it correctly). Perhaps it should not be reset to zero, but one, since that's where pages start out. re: why use sqrt(opic) * docSimilarity instead of log(opic * docSimilarity) Wrapping log() around things changes the score value but not the ranking. So the question is really, why use sqrt(opic)*docSimilarity and not just opic*docSimilarity? The answer is simply that I tried a few queries and sqrt seemed to be required for OPIC to not overly dominate scoring. It was a seat of the pants calculation, trying to balance the strength of anchor matches, opic scoring and title, url and body matching, etc. One can disable this by changing the score power parameter. Indexer doesn't consider linkdb when calculating boost value Key: NUTCH-267 URL: http://issues.apache.org/jira/browse/NUTCH-267 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Chris Schneider Priority: Minor Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if indexer.boost.by.link.count was true, the indexer boost value was scaled based on the log of the # of inbound links: if (boostByLinkCount) res *= (float)Math.log(Math.E + linkCount); This is no longer true (even before Andrzej implemented scoring filters). Instead, the boost value is just the square root (or some other scorePower) of the page score. Shouldn't the invertlinks command, which creates the linkdb, have some affect on the boost value calculated during indexing (either via the OPICScoringFilter or some other built-in filter)? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Jérôme Charron wrote: This means there's no markup in the OpenSearch output? Yes, no markup for now. Doesn't this break any existing application that uses OpenSearch and displays summaries in a web browser? This is an incompatible change which we should avoid. Shouldn't there be? The restriction on description field is : Can contain simple escaped HTML markup, such as b, i, a, and img elements. So, ya, why not. We can add b around highlights. What you and others thinks? +1 Perhaps this should be a method on Summary, to render it as html? I had some hesitations about this while coding In fact, as suggested in the issue's comments, I would like to add a generic method on Summary : String toString(Encoder, Formatter) like in the Lucene's Highlighter and provide some basic implementations of Encoder and Formatter. That sounds fine, but in the meantime, let's not reproduce the html-specific code in lots of places. We need it in both search.jsp and in OpenSearchServlet.java. So we should have it in a common place. A method on Summary seems like a good place. If we subsequently add a more general API then we could re-implement the toHtml() method using that API, but I think a generic toHtml() method will be useful for quite a while yet. Doug
Re: dfs -report
This is a known, fixed, Hadoop bug: http://issues.apache.org/jira/browse/HADOOP-201 I'm going to release Hadoop 0.2.1 with this and one other patch as soon as Subversion is back up, then upgrade Nutch to use 0.2.1. Doug Marko Bauhardt wrote: Hi all, i start nutch-0.8-dev (Revision 405738) on distributed filesystem. If i execute bin/hadoop dfs -report an exception occurs. java.lang.RuntimeException: java.lang.IllegalAccessException: Class org.apache.hadoop.io.WritableFactories can not access a member of class org.apache.hadoop.dfs.DatanodeInfo with modifiers public at org.apache.hadoop.io.WritableFactories.newInstance (WritableFactories.java:49) at org.apache.hadoop.io.ObjectWritable.readObject (ObjectWritable.java:226) at org.apache.hadoop.io.ObjectWritable.readObject (ObjectWritable.java:163) at org.apache.hadoop.io.ObjectWritable.readObject (ObjectWritable.java:211) at org.apache.hadoop.io.ObjectWritable.readFields (ObjectWritable.java:60) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170) Caused by: java.lang.IllegalAccessException: Class org.apache.hadoop.io.WritableFactories can not access a member of class org.apache.hadoop.dfs.DatanodeInfo with modifiers public at sun.reflect.Reflection.ensureMemberAccess(Reflection.java: 65) at java.lang.Class.newInstance0(Class.java:344) at java.lang.Class.newInstance(Class.java:303) at org.apache.hadoop.io.WritableFactories.newInstance (WritableFactories.java:45) ... 5 more What i doing wrong? Marko
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Jérôme Charron wrote: Yes Doug, but in fact, the idea is to add the toString(Formatter) method in a common place (Summary). And add one specific Formatter implementation for OpenSearch and another one for search.jsp : The reason is that they should not use the same HTML code : 1. OpenSearch should only use b around highlights 2. search.jsp should use some more complicated HTML code (span ... ) In fact, I don't know if the Formatter solution is the good one, but the toString() or toHtml() must be parametrized since the two pieces of code that use this method should have distinct outputs. This all sounds fine, I'm just remarking that, at present, the OpenSearch output has changed incompatibly, which is a bad thing, and that I wish, until this is fully worked out, OpenSearch returned what it did before (markup, although perhaps exceeding what's advised). Doug
[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378765 ] Doug Cutting commented on NUTCH-267: Andrzej: your analysis is correct, but it mostly only applies when re-crawling. In an initial crawl, where each url is fetched only once, I think we implement the OPIC Greedy strategy. The question of what to do when re-crawling has not been adequately answered, but, glancing at the paper, it seems that resetting a urls score to zero each time it is fetched might be the best thing to do, so that it can start accumulating more cash. When ranking, summing logs is the same as multiplying, no? Indexer doesn't consider linkdb when calculating boost value Key: NUTCH-267 URL: http://issues.apache.org/jira/browse/NUTCH-267 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Chris Schneider Priority: Minor Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if indexer.boost.by.link.count was true, the indexer boost value was scaled based on the log of the # of inbound links: if (boostByLinkCount) res *= (float)Math.log(Math.E + linkCount); This is no longer true (even before Andrzej implemented scoring filters). Instead, the boost value is just the square root (or some other scorePower) of the page score. Shouldn't the invertlinks command, which creates the linkdb, have some affect on the boost value calculated during indexing (either via the OPICScoringFilter or some other built-in filter)? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ] Doug Cutting commented on NUTCH-134: +1 for Summary as Writable and change HitSummarizer.getSummary() to return a Summary directly rather than a String. I don't think this has bad performance implications. Summarizer doesn't select the best snippets --- Key: NUTCH-134 URL: http://issues.apache.org/jira/browse/NUTCH-134 Project: Nutch Type: Bug Components: searcher Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev Reporter: Andrzej Bialecki Attachments: summarizer.060506.patch Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring). To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an int order field, and the collected excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378560 ] Doug Cutting commented on NUTCH-267: The OPIC score is much like a count of incoming links, but a bit more refined. OPIC(P) is one plus the sum of the OPIC contributions for all links to a page. The OPIC contribution of a link from page P is OPIC(P) / numOutLinks(P). Indexer doesn't consider linkdb when calculating boost value Key: NUTCH-267 URL: http://issues.apache.org/jira/browse/NUTCH-267 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Chris Schneider Priority: Minor Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if indexer.boost.by.link.count was true, the indexer boost value was scaled based on the log of the # of inbound links: if (boostByLinkCount) res *= (float)Math.log(Math.E + linkCount); This is no longer true (even before Andrzej implemented scoring filters). Instead, the boost value is just the square root (or some other scorePower) of the page score. Shouldn't the invertlinks command, which creates the linkdb, have some affect on the boost value calculated during indexing (either via the OPICScoringFilter or some other built-in filter)? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: generate.max.per.host is per reduce task
Chris Schneider wrote: I just noticed that the generate.max.per.host property is only enforced on a per reduce task basis during the first generate job (see Generator.Selector.reduce for details). At a minimum, it should probably be documented this way in nutch-default.xml.template. Yes, but all URLs with the same host are a single reduce task, since it is generating host-disjoint fetch lists. Doug
Re: svn commit: r399515 - /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
This sort of error will become much harder to make once we upgrade to Hadoop 0.2 and replace most uses of java.io.File with org.apache.hadoop.fs.Path. Doug [EMAIL PROTECTED] wrote: Author: ab Date: Wed May 3 19:42:02 2006 New Revision: 399515 URL: http://svn.apache.org/viewcvs?rev=399515view=rev Log: Use the FileSystem instead of java.io.File.exists(). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java?rev=399515r1=399514r2=399515view=diff == --- lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java (original) +++ lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Wed May 3 19:42:02 2006 @@ -502,7 +502,7 @@ } } Configuration conf = NutchConfiguration.create(); -FileSystem fs = FileSystem.get(conf); +final FileSystem fs = FileSystem.get(conf); SegmentReader segmentReader = new SegmentReader(conf, co, fe, ge, pa, pd, pt); // collect required args switch (mode) { @@ -529,7 +529,9 @@ File dir = new File(args[++i]); File[] files = fs.listFiles(dir, new FileFilter() { public boolean accept(File pathname) { -if (pathname.isDirectory()) return true; +try { + if (fs.isDirectory(pathname)) return true; +} catch (IOException e) {}; return false; } });
CommerceNet Events » Blog Archive » T 3 5/11: Stefan Groschupf on Extending Nutch
It seems Stefan is giving a talk... http://events.commerce.net/?p=58 Doug
Re: mapred question
[EMAIL PROTECTED] wrote: As far as we understood from MapRed documentation all reduce tasks must be launched after last map task is finished e.g map and reduce must not work simultaneously. But often in logs we see such records: map 80%, reduce 10% and many more records where map is less then 100% but reduce more than 0%. How should we interpret this? Hadoop includes the shuffle stage in reduce. Currently, first 25% of a reduce task's progress is copying map outputs to the reduce node. These copies can start as soon as any map tasks completes, so that, when the last map task completes there is very little data remaining to be copied, and the rest of the reduce work can quickly start. Doug
Re: Content-Type inconsistency?
Jérôme Charron wrote: We had to turn off the guessing of content types to index Apache correctly. Instead of turning off the guessing of content types you should only to remove the magic for xml in mime-types.xml Perhaps that would have worked also, but, with Apache, simply trusting the declared Content-Type seems to work quite well. I think we shouldn't aim guess things any more than a browser does. If browsers require standards compliance, then our lives will be simpler. Yes, but actually Nutch cannot acts as a browser. For instance with RSS: A browser know that a URL is a RSS feed because there is a link rel=alternate type=.../ with the correct content-type (application/rss+xml) in the refering HTML page. Nutch doesn't keep such informations for guessing a content-type (it could be a good think to add), so it must find the content-type from the URL (without any context). Shouldn't RSS feeds declare the correct content-type? http://feedvalidator.org/docs/warning/NonSpecificMediaType.html I don't see that context should be required for feeds. Doug
[jira] Commented: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field
[ http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376989 ] Doug Cutting commented on NUTCH-257: I'd vote to never have Summary#toString() perform entity encoding, to fix search.jsp to encode things itself, and *not* to add a new Summary#toEntityEncodedString() method. Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field - Key: NUTCH-257 URL: http://issues.apache.org/jira/browse/NUTCH-257 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: [EMAIL PROTECTED] Priority: Minor All search result data we display in search results has to be explicitly Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries. Its already Entity.encoded. This is fine when outputing HTML but it gets in the way when outputing otherwise -- as xml for example. I'd suggest we not make any presumption about how search results are used. The problem becomes especially acute when the text language is other than english. Here is an example of what a Czech description field in an OpenSearchServlet hit record looks like: descriptionlt;span class=ellipsisgt; ... lt;/spangt;Vamp;#283;deckamp;aacute; knihovna v Olomouci Bezruamp;#269;ova 2, Olomouc 9, 779 11, amp;#268;eskamp;aacute; republika amp;nbsp; tel. +420-585223441 amp;nbsp; fax +420-585225774 http://www.lt;span class=highlightgt;vkollt;/spangt;.cz/ amp;nbsp;amp;nbsp; mailto:info@lt;span class=highlightgt;vkollt;/spangt;.cz Otevamp;#345;eno : amp;nbsp; po-pamp;aacute; amp;nbsp; 8 30 -19 00 amp;nbsp;amp;nbsp;amp;nbsp; so amp;nbsp; 9 00 -13 00 amp;nbsp;amp;nbsp;amp;nbsp; ne amp;nbsp; zavamp;#345;eno V katalogu s amp;uacute;plnamp;yacute;m amp;#269;asovamp;yacute;mlt;span class=ellipsisgt; ... lt;/spangt;03 Organizace 20/12 Odkazy 19/04 Hledej 23/03 amp;nbsp; 23/03 amp;nbsp; Poamp;#269;et pamp;#345;amp;iacute;stupamp;#367; od 1.9.1998. Statistiky . [ ] amp;nbsp; [ Nahoru ] lt;span class=highlightgt;VKOLlt;/spangt;/description Here is same description field with Entity.encoding disabled: descriptionlt;span class=ellipsisgt; ... lt;/spangt;tisky statistiky knihovny WWW serveru st?edov?ké rukopisy studovny CD-ROM historických fond? hlavní Internet N?mecké knihovny vázaných novin SVKOL viz lt;span class=highlightgt;VKOLlt;/spangt; ?atna T telefonní ?ísla knihovny zam?stnanc? U V vazba v?cný popis vedení knihovny vedoucí odd?lení video lt;span class=highlightgt;VKOLlt;/spangt; volný výb?r výp?j?ka výro?ní zpráva výstavy W webmaster WWW odkazy X Y Z - ? zamluvení knihy zahrani?ní periodika zpracování fondult;span class=highlightgt;VKOLlt;/spangt; - hledej Hledej [ lt;span class=highlightgt;VKOLlt;/spangt; ] [ Novinky ] [ Katalog ] [ Slu?by ] [ Aktivity ] [ Pr?vodce ] [ Dokumenty ] [ Regionální fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [ ] [ ] Obsah full-textové vyhledávání, 19/04/2003 rejst?ík vybranýchlt;span class=ellipsisgt; ... lt;/spangt;/description Notice how the Czech characters in the first are all numerically encoded: i.e. #NNN;. I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and that toString return raw aggregation of Fragments. Would likely require adding methods to the HitSummarizer interface so can ask for either raw text or entity encoded with addition to NutchBean so can ask for either. Or, better I'd suggest is that Summarizer never return Entity.encoded text. Let that happen in search.jsp (I can make patch to do the latter if its amenable). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: exception
[EMAIL PROTECTED] wrote: We updated hadoop from trunk branch. But now we get new errors: Oops. Looks like I introduced a bug yesterday. Let me fix it... Sorry, Doug
Re: TRUNK IllegalArgumentException: Argument is not an array (WAS: Re: exception)
I just fixed this. Sorry for the inconvenience! Doug Michael Stack wrote: I'm getting same as Anton below trying to launch a new job with latest from TRUNK. Logic in ObjectWriteable#readObject seems a little off. On the way in we test for a null instance. If null, we set to NullWriteable. Next we test declaredClass to see if its an array. We then try to do an Array.getLength on instance -- which we've above set as NullWriteable. Looks like we should test instance to see if its NullWriteable before we do the Array.getLength (or do the instance null check later). Hope above helps, St.Ack [EMAIL PROTECTED] wrote: We updated hadoop from trunk branch. But now we get new errors: On tasktarcker side: skiped java.io.IOException: timed out waiting for response at org.apache.hadoop.ipc.Client.call(Client.java:305) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:149) at org.apache.hadoop.mapred.$Proxy0.pollForTaskWithClosedJob(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:310) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:374) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:813) 060427 062708 Client connection to 10.0.0.10:9001 caught: java.lang.RuntimeException: java.lang.ClassNotFoundException: java.lang.RuntimeException: java.lang.ClassNotFoundException: at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:152) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:139) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:186) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:60) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170) 060427 062708 Client connection to 10.0.0.10:9001: closing On jobtracker side: skiped 060427 061713 Server handler 3 on 9001 caught: java.lang.IllegalArgumentException: Ar gument is not an array java.lang.IllegalArgumentException: Argument is not an array at java.lang.reflect.Array.getLength(Native Method) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:92) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:250) skiped -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 12:48 AM To: nutch-dev@lucene.apache.org Subject: Re: exception Importance: High This is a Hadoop DFS error. It could mean that you don't have any datanodes running, or that all your datanodes are full. Or, it could be a bug in dfs. You might try a recent nightly build of Hadoop to see if it works any better. Doug Anton Potehin wrote: What means error of following type : java.rmi.RemoteException: java.io.IOException: Cannot obtain additional block for file /user/root/crawl/indexes/index/_0.prx
Re: Content-Type inconsistency?
Jérôme Charron wrote: Finaly it is a good news that Nutch seems to be more intelligent on content-type guessing than Firefox or IE, no? I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as text/html Nutch decided to treat as XML, not HTML. We had to turn off the guessing of content types to index Apache correctly. I think we shouldn't aim guess things any more than a browser does. If browsers require standards compliance, then our lives will be simpler. Doug
Re: exception
This is a Hadoop DFS error. It could mean that you don't have any datanodes running, or that all your datanodes are full. Or, it could be a bug in dfs. You might try a recent nightly build of Hadoop to see if it works any better. Doug Anton Potehin wrote: What means error of following type : java.rmi.RemoteException: java.io.IOException: Cannot obtain additional block for file /user/root/crawl/indexes/index/_0.prx
[jira] Resolved: (NUTCH-250) Generate to log truncation caused by generate.max.per.host
[ http://issues.apache.org/jira/browse/NUTCH-250?page=all ] Doug Cutting resolved NUTCH-250: Fix Version: 0.8-dev Resolution: Fixed Assign To: Doug Cutting I just committed this. Thanks, Rod. Generate to log truncation caused by generate.max.per.host -- Key: NUTCH-250 URL: http://issues.apache.org/jira/browse/NUTCH-250 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Rod Taylor Assignee: Doug Cutting Fix For: 0.8-dev Attachments: nutch-generate-truncatelog.patch LOG.info() hosts which have had their generate lists truncated. This can inform admins about potential abusers or excessively large sites that they may wish to block with rules. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: mapred.map.tasks
Anton Potehin wrote: We have a question on this property. Is it really preferred to set this parameter several times greater than number of available hosts? We do not understand why it should be so? It should be at least numHosts*mapred.tasktracker.tasks.maximum, so that all of the task slots are used. More tasks makes recovery faster when a task fails, since less needs to be redone. Our spider is distributed among 3 machines. What value is most preferred for this parameter in our case? Which other factors may have effect on most preferred value of this parameter? When fetching, the total number of hosts you're fetching can also be a factor, since fetch tasks are hostwise-disjoint. If you're only fetching a few hosts, then a large value for mapred.map.tasks will cause there to be a few big fetch tasks and a bunch of empty ones. This could be a problem if the big ones are not allocated evenly among your nodes. I generally use 5*numHosts*mapred.tasktracker.tasks.maximum. Doug
Re: jobtaraker and tasktracker
Anton Potehin wrote: Are there any ways to rotate these logs ? One way would be to configure the JVM to use a rolling FileHandler: file:///home/cutting/local/jdk1.5-docs/api/java/util/logging/FileHandler.html This should be possible by setting HADOOP_OPTS (in conf/hadoop-env.sh) and NUTCH_OPTS to include something like: -Djava.util.logging.config.file=myfile The default logging config file is in your JVM, at jre/lib/logging.properties. I have not in fact tried this. If you do, please tell how it works. Doug
Re: question about crawldb
Anton Potehin wrote: 1. We have found these flags in CrawlDatum class: public static final byte STATUS_SIGNATURE = 0; public static final byte STATUS_DB_UNFETCHED = 1; public static final byte STATUS_DB_FETCHED = 2; public static final byte STATUS_DB_GONE = 3; public static final byte STATUS_LINKED = 4; public static final byte STATUS_FETCH_SUCCESS = 5; public static final byte STATUS_FETCH_RETRY = 6; public static final byte STATUS_FETCH_GONE = 7; Though the names of these flags describe their aims, it is not clear completely what they mean and what is the difference between STATUS_DB_FETCHED and STATUS_FETCH_SUCCESS for example. The STATUS_DB_* codes are used in entries in the crawldb. STATUS_FETCH_* codes are used in fetcher output. STATUS_LINKED is used in parser output for urls that are linked to. A crawldb update combines all of these (the old version of the db, plus fetcher and parser output) to generate a new version of the db, containing only STATUS_DB_* entries. This logic is in CrawlDbReducer. Does that help? Doug
Re: Duplicate Detection: Offlince vs. Search Time
Shailesh Kochhar wrote: If I understand this correctly, you can only dedup by one field. This would mean that if you were to implement and use content-based deduplication, you'd have to give up limiting the number of hits per host. Is this correct, or did I miss something? That's correct. That's what's currently implemented. Doug
Re: svn commit: r394228 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/plugin/ src/plugin/ src/plugin/analysis-de/ src/plugin/analysis-fr/ src/plugin/clustering-carrot2/ src/plugin/creativecom
[EMAIL PROTECTED] wrote: +!-- Copy the plugin.dtd file to the plugin doc-files dir -- +copy file=${plugins.dir}/plugin.dtd + todir=${src.dir}/org/apache/nutch/plugin/doc-files/ The build should not make changes to the source tree. The source tree should be read-only to the build. All changes during build should be confined to the build directory. Is this just needed for references from javadoc? If so, then this can be copied to build/docs, no? Doug
[jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement
[ http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374272 ] Doug Cutting commented on NUTCH-246: It seems like the Injector should be loading the current time from a job configuration property in the same way that that the Generator is doing [...] That sounds like a good plan. Will you construct a patch for this? segment size is never as big as topN or crawlDB size in a distributed deployement - Key: NUTCH-246 URL: http://issues.apache.org/jira/browse/NUTCH-246 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Minor Fix For: 0.8-dev I didn't reopen NUTCH-136 since it is may related to the hadoop split. I tested this on two different deployement (with 10 ttrackers + 1 jobtracker and 9 ttracks and 1 jobtracker). Defining map and reduce task number in a mapred-default.xml does not solve the problem. (is in nutch/conf on all boxes) We verified that it is not a problem of maximum urls per hosts and also not a problem of the url filter. Looks like the first job of the Generator (Selector) already got to less entries to process. May be this is somehow releasted to split generation or configuration inside the distributed jobtracker since it runs in a different jvm as the jobclient. However we was not able to find the source for this problem. I think that should be fixed before publishing a nutch 0.8. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: 0.8 release schedule (was Re: latest build throws error - critical)
Chris Mattmann wrote: +1 for a release sooner rather than later. I think this is a good plan. There's no reason we can't do another release in a month. If it is back-compatbible we can call it 0.8.x and if it's incompatible we can call it 0.9.0. I'm going to make a Hadoop 0.1.1 release today that can be included in Nutch 0.8.0. (With Hadoop we're going to aim for monthly releases, with potential bugfix releases between when serious bugs are found. The big bug in Hadoop 0.1.0 is http://issues.apache.org/jira/browse/HADOOP-117.) So we could aim for a Nutch 0.8.0 release sometime next week. Does that work for folks? Piotr, would you like to make this release, or should I? Doug
Re: CrawlDbReducer - selecting data for DB update
Andrzej Bialecki wrote: This selection is primarily made in the while() loop in CrawlDbReducer:45. My main objection is that selecting the highest value (meaning most recent) relies on the fact that values of status codes in CrawlDatum are ordered according to their meaning, and they are treated as a sort of state machine. Yes, that was the design, that status codes are also priorities. However, adding new states is very difficult, if they should have values lower than STATUS_FETCH_GONE, as it leads to breaking backwards-compatibility with older segment data. We can use CrawlDatum.VERSION to insert new status codes back-compatibly. Perhaps we should change the codes to, instead of [0, 1, 2, ...] to be [0, 10, 20, 30, ...] so that we can more easily introduce new values? To update status codes from older versions we simply multiply by 10. Would something like that work? Or we could have a separate table mapping status codes to priority. Doug
Re: PMD integration
Piotr Kosiorowski wrote: I will make it totally separate target (so test do not depend on it). That was actually Doug's idea (and I agree with it) to stop the build file if PMD complains about something. It's similar to testing -- if your tests fail, the entire build file fails. I totally agree with it - but I want to switch it on for others to play first, and when we agree on rules we want to use make it obligatory. So we start out comitting it as an independent target, and then add it to the test target? Is that the plan? If so, +1. Doug
Re: web ui improvement
Sami Siren wrote: I know there are people who think that a plain xml interface is good enough for all but I would like to give this new architecture a try. I think this would be a great addition. The XML has a lot of uses, but we should include a good native, extensible, skinnable search UI. +1 As part of the required functionality of the 0.8 release discussion on some other thread my opinion is to postbone any new ui functionality (for example NUTCH-48) until the new architecture is in place I would not veto someone testing committing NUTCH-48. We should avoid investing too much effort into this if it will soon be obsolete. But if a small effort will give folks did you mean support in the interim, that's not a bad thing. Of course, folks can always apply this patch themselves... Doug
0.8 release schedule (was Re: latest build throws error - critical)
TDLN wrote: I mean, how do others keep uptodate with the main codeline? Do you advice updating everyday? Should we make a 0.8.0 release soon? What features are still missing that we'd like to get into this release? Doug
Re: Search quality evaluation
FYI, Mike wrote some evaluation stuff for Nutch a long time ago. I found it in the Sourceforge Attic: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/quality/Attic/ This worked by querying a set of search engines, those in: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/engines/ The results of each engine is scored by how much they differ from all of the other engines combined. The Kendall Tau distance is used to compare rankings. Thus this is a good tool to find out how close Nutch is to the quality of other engines, but it may not not be a good tool to make Nutch better than other search engines. In any case, it includes a system to scrape search results from other engines, based on Apple's Sherlock search-engine descriptors. These descriptors are also used by Mozilla: http://mycroft.mozdev.org/deepdocs/quickstart.html So there's a ready supply of up-to-date descriptions for most major search engines. Many engines provide a skin specifically to simplify parsing by these plugins. The code that implemented Sherlock plugins in Nutch is at: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/quality/dynamic/ Doug Andrzej Bialecki wrote: Hi, I found this paper, more or less by accident: Scaling IR-System Evaluation using Term Relevance Sets; Einat Amitay, David Carmel, Ronny Lempel, Aya Soffer http://einat.webir.org/SIGIR_2004_Trels_p10-amitay.pdf It gives an interesting and rather simple framework for evaluating the quality of search results. Anybody interested in hacking together a component for Nutch and e.g. for Google, to run this evaluation? ;)
Re: Add .settings to svn:ignore on root Nutch folder?
Other options (raised on the Hadoop list) are Checkstyle: http://checkstyle.sourceforge.net/ and FindBugs: http://findbugs.sourceforge.net/ Although these are both under LGPL and thus harder to include in Apache projects. Anything that generates a lot of false positives is bad: it either causes us to skip analysis of lots of files, or ignore the warnings. Skipping the JavaCC-generated classes is reasonable, but I'm wary of skipping much else. Sigh. Doug Dawid Weiss wrote: Ok, PMD seems like a good idea. I've added it to the build file. Unused code detection shows a few catches (javacc-generated classes need to be ignored because they contain a lot of junk), but unfortunately it also displays false positives such as in: MapWritable.java 429 {Avoid unused private fields such as 'fKeyClassId'} This field is private but is used in an outside class (through a synthetic accessor I presume, so a simple syntax tree analysis PMD does is insufficient to catch it). These things would need to be marked in the code as ignorable... Do you want me to create a JIRA issue for this, Doug? Or should we drop the subject? Oh, I forgot to say this: PMD's jars add a minimum of 1MB to the codebase (Xerces can be reused). D.
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372979 ] Doug Cutting commented on NUTCH-240: +1 for committing Generator.patch.txt now. 0 for committing the rest until I've had more time to think about it. I'm not against it, but, at a glance, I'm still hopeful we can do better. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt, patch1.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372981 ] Doug Cutting commented on NUTCH-240: Also, note that we can now extend Hadoop's new MapReduceBase to implement configure() and close() for many Mappers and Reducers, including the one's in this patch. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt, patch1.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Refactoring some plugins
Jérôme Charron wrote: One more question about javadoc (I hope the last one): Do you think it makes sense to split the plugins gathered into the Misc group into many plugins (such as index-more / query-more), so that each sub-plugin can be dispatched into proper Group. No, I don't think so. These are strongly related bundles of plugins. When you change one chances are good you'll change the others, so it makes sense to keep their code together rather than split it up. Folks can still find all implementations of an interface in the javadoc, just not always grouped together in the table of contents. We could instead of calling these misc call them compound plugins or something. We can change the package.html for each to list the coordinated set of plugins they provide. For example, language-identifier's could say something like, Includes parse, index and query plugins to identify, index and make searchable the identified language. Doug