[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542819 ] Renaud Richardet commented on NUTCH-444: hi, i am travelling and will be offline until january 2008. thanks for your patience. Renaud bonjour, je suis en voyage et ne serai pas atteignable par mail avant janvier 2008. merci de votre patience. Renaud -- renaudatoslutionsdotcom www.oslutions.com Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.0.0 Attachments: feed.tar.bz2, NUTCH-444.1-1.patch, NUTCH-444.Mattmann.061707.patch.txt, NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the last year Alternatives are: - Rome - Informa - custom implementation based on Stax - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-540) some problem about the Nutch cache
[ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-540: --- Priority: Major (was: Blocker) could you please attach log files and error messages? thanks some problem about the Nutch cache -- Key: NUTCH-540 URL: https://issues.apache.org/jira/browse/NUTCH-540 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9 Reporter: crossany Fix For: 0.9.0 I'am a chinese. I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error. I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.
[ https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-369: --- Attachment: patch.diff unified diff against head. - fixes encoding, as described by King Kong - removes non-valid features - fixes logging StringUtil.resolveEncodingAlias is unuseful. - Key: NUTCH-369 URL: https://issues.apache.org/jira/browse/NUTCH-369 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Environment: all Reporter: King Kong Attachments: patch.diff After we defined encoding alias map in StringUtil , but parse html use orginal encoding also. I found it is reading charset from meta in nekohtml which HtmlParser used . we can set it's feature http://cyberneko.org/html/features/scanner/ignore-specified-charset; to true that nekohtml will use encoding we set; concretely, private DocumentFragment parseNeko(InputSource input) throws Exception { DOMFragmentParser parser = new DOMFragmentParser(); // some plugins, e.g., creativecommons, need to examine html comments try { + parser.setFeature(http://cyberneko.org/html/features/scanner/ignore-specified-charset,true); parser.setFeature(http://apache.org/xml/features/include-comments;, true); BTW, It must be add on front of try block,because the following sentence (parser.setFeature(http://apache.org/xml/features/include-comments;, true);) will throw exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.
[ https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-369: --- Attachment: remover.diff just FYI, you can further filter which element neko should keep and remove. see the patch for an example and http://people.apache.org/~andyc/neko/doc/html/settings.html StringUtil.resolveEncodingAlias is unuseful. - Key: NUTCH-369 URL: https://issues.apache.org/jira/browse/NUTCH-369 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: all Reporter: King Kong Priority: Minor Attachments: patch.diff, remover.diff After we defined encoding alias map in StringUtil , but parse html use orginal encoding also. I found it is reading charset from meta in nekohtml which HtmlParser used . we can set it's feature http://cyberneko.org/html/features/scanner/ignore-specified-charset; to true that nekohtml will use encoding we set; concretely, private DocumentFragment parseNeko(InputSource input) throws Exception { DOMFragmentParser parser = new DOMFragmentParser(); // some plugins, e.g., creativecommons, need to examine html comments try { + parser.setFeature(http://cyberneko.org/html/features/scanner/ignore-specified-charset,true); parser.setFeature(http://apache.org/xml/features/include-comments;, true); BTW, It must be add on front of try block,because the following sentence (parser.setFeature(http://apache.org/xml/features/include-comments;, true);) will throw exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472733 ] Renaud Richardet commented on NUTCH-443: hi All, Glad to see that this patch is moving forward :-) I have been carried away by a project, but will have some time this week. Please let me know if there is anything I can help on, especially on the parsers. Cheers, Renaud allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Chris A. Mattmann Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-443: --- Attachment: NUTCH-443-draft-v4.patch Hi Dogacan, Thanks for merging the patches, good teamwork! I worked on the RSS parser, it should now basically work. Now, all core and plugin-tests pass, except for TestRSSparser, will work on that. Once this is in place, I will have a look at the other issues with fetch time, etc. I merged my changes with your patch, version 3. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471878 ] Renaud Richardet commented on NUTCH-443: Nutch Newbie, Gal, Chris It's great that you discuss alternative RSS parsing libraries, bug the resolution of this bug does not depends on which underlying RSS library is used in RSSParser. Would you mind moving the conversation to the new issue I created for it (NUTCH-444), thanks a bunch. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-443: --- Attachment: parsers.diff Great, here's my work-in-progress(not finished, not tested) for updating all parsers and extending the RSSparser. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: FW: RSS-fecter and index individul-how can i realize this function
HUYLEBROECK Jeremy RD-ILAB-SSF wrote: I send again this message as it apparently didn't go through. (I am messing up with my email addresses on the mailing list...) -Original Message- Sent: Friday, February 02, 2007 10:29 AM Using Nutch 0.8, we modified the code starting at the fetching/parsing steps and the following. We have a different implementation of the Parse Object and OutputFormat including an additional list of ParseData objects saved in an additionnal subfolder in the DFS. We changed the indexing step a lot too, so we don't use the nutch code there. Is your implementation similar to what we started at https://issues.apache.org/jira/browse/NUTCH-443? If you think some of your changes could be integrated, please post a patch there. Thanks for sharing, Renaud -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Friday, February 02, 2007 10:19 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Attention, votre correspondant continue de vous écrire à votre ancienne adresse en @orange-ft.com, qui va être désactivée début avril. Veuillez lui demander de mettre à jour son carnet d'adresses avec votre nouvelle adresse en @orange-ftgroup.com. Caution : your correspondent is still writing to your orange-ft.com address, which will be disabled beginning of April. Please ask him/her to update his/her address book to orange-ftgroup.com .. Gal Nitzan wrote: IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Perhaps ProtocolOutput should change. The method: Content getContent(); could be deprecated and replaced with: Content[] getContents(); This would require changes to the indexing pipeline. I can't think of any severe complications, but I haven't looked closely. Could something like that work? Doug -- Renaud Richardet +1 617 230 9112 my email is my first name at apache.org http://www.oslutions.com
[jira] Created: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: RSS-fecter and index individul-how can i realize this function
Doug Cutting wrote: Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? Exactly. So now the question is, how much impact would this change to the Parser API have on the rest of Nutch? It would require changes to all Parser implementations, to ParseSegement, to ParseUtil, and to Fetcher. But, as far as I can tell, most of these changes look straightforward. I think so, too. I have opened an issue in JIRA (https://issues.apache.org/jira/browse/NUTCH-443) and will give it a try. Doğacan, have you started working on it yet? Thanks, Renaud
Re: RSS-fecter and index individul-how can i realize this function
Hi Chris, Doug, Chris Mattmann wrote: Hi Doug, Since the target of the link must still be indexed separately from the item itself, how much use is all this? If the RSS document is considered a single page that changes frequently, and item's links are considered ordinary outlinks, isn't much the same effect achieved? IMHO, yes. That's what it's been hard for me to understand the real use case for what Gal et al. are talking about. I've been trying to wrap my head around it, but it seems to me the capability they require is sort of already provided... Not sure I understand: An RSS-feed is a collection of feed-entries, and each feed-entry would be indexed a a separate document (each feed-entry has a url or uuid as unique identifier). What happens with the RSS-feed itself? Is it indexed, or considered as a container that just needs to be fetched and fetched again for new entries? The usecase is that you index RSS-feeds, but your users can search each feed-entry as a single document. Does it makes sense? Thanks, Renaud Cheers, Chris Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -- Renaud Richardet +1 617 230 9112 my email is my first name at apache.org http://www.oslutions.com
Re: api.RegexURLFilterBase - Configuration Resources
Tobias Zahn wrote: Hello! I have written a new plugin extending the IndexingFilter and using the RegexURLFilterBase class. In the log there is this message: FATAL api.RegexURLFilterBase - Can't find resource: null in your new class CustomIndexingFilter, create a field Configuration conf, and implement setConf, getConf like this: public void setConf(Configuration conf) { this.conf = conf; } public Configuration getConf() { return this.conf; } and pass the conf object to RegexURLFilterBase before calling it. RegexURLFilterBase r = new RegexURLFilter(); r.setConf(conf); r.filter(sometext); This should do the trick. I assume you have setup the build configuration of your plugin correctly, was tricky for me ;-) build.xml !-- Build compilation dependencies -- target name=deps-jar .. ant target=jar inheritall=false dir=../urlfilter-regex/ ant target=jar inheritall=false dir=../lib-regex-filter/ /target !-- Add compilation dependencies to classpath -- path id=plugin.deps fileset dir=${nutch.root}/build ... include name=**/urlfilter-regex/*.jar / include name=**/lib-regex-filter/*.jar / /fileset /path and plugin.xml requires import plugin=nutch-extensionpoints/ .. import plugin=urlfilter-regex/ import plugin=lib-regex-filter/ /requires HTH, Renaud I don't know how to handle that Configuration-Objects (setConf() etc.) What should I do to avoid that error? Where does the Configuration-Object come from? TIA Tobias Zahn -- Renaud Richardet +1 617 230 9112 my email is my first name at apache.org http://www.oslutions.com
Re: RSS-fecter and index individul-how can i realize this function
Doug Cutting wrote: Renaud Richardet wrote: The usecase is that you index RSS-feeds, but your users can search each feed-entry as a single document. Does it makes sense? But each feed item also contains a link whose content will be indexed and that's generally a superset of the item. Agreed So should there be two urls indexed per item? I don't think so In many cases, the best thing to do is to index only the linked page, not the feed item at all. In some (rare?) cases, there might be items without a link, whose only content is directly in the feed, or where the content in the feed is complementary to that in the linked page. In these cases it might be useful to combine the two (the feed item and the linked content), indexing both. The proposed change might permit that. Is that the case you're concerned about? I see. I was thinking that I could index the feed items without having to fetch them individually. More fundamentally, I want to index only the blog-entry text, and not the elements around it (header, menus, ads, ...), so as to improve the search results. Here's my case, the proposed changes would allow me to do (*) 1) parse feeds: for each (feedentry : feed) do | | if (full-text entries) then | | index each feed entry as a single document; blog header, menus are not indexed. * | else | | create a special outlink for each feed entry, which include metadata (content, time, etc) | endif | done 2) on a next fetch loop: for each (link) do | | if (this is a normal link) || fetch it and index it normally | else if (this link come from an already indexed feed entry) then || end, do not fetch it * | else if (this is a special outlink) || guess which DOM nodes hold the post content || index it; blog header, menus are not indexed. | endif | done Thanks, Renaud
Re: RSS-fecter and index individul-how can i realize this function
not reflect those of either NASA, JPL, or the California Institute of Technology. -- renaud richardet +1 617 230 9112 renaud at oslutions.com http://www.oslutions.com
[jira] Updated: (NUTCH-412) plugin to parse the feed-url (rss/atom) of a blog
[ http://issues.apache.org/jira/browse/NUTCH-412?page=all ] Renaud Richardet updated NUTCH-412: --- Attachment: plugin_parse-feedUrl2.diff plugin to parse the feed-url (rss/atom) of a blog - Key: NUTCH-412 URL: http://issues.apache.org/jira/browse/NUTCH-412 Project: Nutch Issue Type: New Feature Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Attachments: plugin_parse-feedUrl.diff, plugin_parse-feedUrl2.diff A plugin that extracts the feed-url (rss/atom) of a blog by retrieving the href from the headlink element (if found), and stores it in metadata. The meta can be accessed with parse.getData().getMeta(feedUrl); you can test this plugin with the main method of HtmlParser. Thanks for a feedback. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-412) plugin to parse the feed-url (rss/atom) of a blog
plugin to parse the feed-url (rss/atom) of a blog - Key: NUTCH-412 URL: http://issues.apache.org/jira/browse/NUTCH-412 Project: Nutch Issue Type: New Feature Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor A plugin that extracts the feed-url (rss/atom) of a blog by retrieving the href from the headlink element (if found), and stores it in metadata. The meta can be accessed with parse.getData().getMeta(feedUrl); you can test this plugin with the main method of HtmlParser. Thanks for a feedback. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-412) plugin to parse the feed-url (rss/atom) of a blog
[ http://issues.apache.org/jira/browse/NUTCH-412?page=all ] Renaud Richardet updated NUTCH-412: --- Attachment: plugin_parse-feedUrl.diff unified diff against head (Rev: 481445) plugin to parse the feed-url (rss/atom) of a blog - Key: NUTCH-412 URL: http://issues.apache.org/jira/browse/NUTCH-412 Project: Nutch Issue Type: New Feature Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Attachments: plugin_parse-feedUrl.diff A plugin that extracts the feed-url (rss/atom) of a blog by retrieving the href from the headlink element (if found), and stores it in metadata. The meta can be accessed with parse.getData().getMeta(feedUrl); you can test this plugin with the main method of HtmlParser. Thanks for a feedback. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-359) extraction of links will fail for whole page if one single link cannot be parsed
extraction of links will fail for whole page if one single link cannot be parsed Key: NUTCH-359 URL: http://issues.apache.org/jira/browse/NUTCH-359 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Environment: Ubuntu Dapper Reporter: Renaud Richardet Priority: Minor Attachments: outlink.diff When Nutch parses the outlinks of a fetched page, the process will fail if a single link cannot be parsed (e.g. java.net.MalformedURLException: unknown protocol). The attached patch will keep indexing the remaining links on that page even if one fails. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [Fwd: Re: [Nutch Wiki] Update of RenaudRichardet by RenaudRichardet]
Thank you Stephan, I think you were meaning editing: http://wiki.apache.org/nutch/RunNutchInEclipse , not http://wiki.apache.org/nutch/RenaudRichardet , right? Yes it's definitively OK with me, that's why I put it in the Wiki. Please keep putting your improvements! Thanks again, Renaud Original Message Subject: Re: [Nutch Wiki] Update of RenaudRichardet by RenaudRichardet Date: Wed, 23 Aug 2006 12:00:22 -0700 From: Stefan Groschupf [EMAIL PROTECTED] Reply-To: nutch-dev@lucene.apache.org To: nutch-dev@lucene.apache.org CC: [EMAIL PROTECTED] References: [EMAIL PROTECTED] Hi Renaud, I updated your page with some more details, I hope that is ok for you. Thanks for creating it. Stefan Am 23.08.2006 um 11:51 schrieb Apache Wiki: Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by RenaudRichardet: http://wiki.apache.org/nutch/RenaudRichardet New page: {{{ Renaud Richardet COO America Wyona Inc. - Open Source Content Management - Apache Lenya office +1 857 776-3195 mobile +1 617 230 9112 renaud.richardet at wyona.com http://www.wyona.com }}} -- Renaud Richardet COO America Wyona- Open Source Content Management - Apache Lenya office +1 857 776-3195 mobile +1 617 230 9112 renaud.richardet at wyona.com http://www.wyona.com
[jira] Updated: (NUTCH-346) Improve readability of logs/hadoop.log
[ http://issues.apache.org/jira/browse/NUTCH-346?page=all ] Renaud Richardet updated NUTCH-346: --- Attachment: log4j_plugins.diff OK, here we go. This patch should be good for 0.8 and trunk. Improve readability of logs/hadoop.log -- Key: NUTCH-346 URL: http://issues.apache.org/jira/browse/NUTCH-346 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: ubuntu dapper Reporter: Renaud Richardet Priority: Minor Attachments: log4j_plugins.diff adding log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN to conf/log4j.properties dramatically improves the readability of the logs in logs/hadoop.log (removes all INFO) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-346) Improve readability of logs/hadoop.log
Improve readability of logs/hadoop.log -- Key: NUTCH-346 URL: http://issues.apache.org/jira/browse/NUTCH-346 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: ubuntu dapper Reporter: Renaud Richardet Priority: Minor adding log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN to conf/log4j.properties dramatically improves the readability of the logs in logs/hadoop.log (removes all INFO) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-266) hadoop bug when doing updatedb
[ http://issues.apache.org/jira/browse/NUTCH-266?page=comments#action_12426579 ] Renaud Richardet commented on NUTCH-266: KuroSaka, yes you can download the hadoop jar, release 0.5.0 from the project website: http://lucene.apache.org/hadoop/ and http://www.apache.org/dyn/closer.cgi/lucene/hadoop/ hadoop bug when doing updatedb -- Key: NUTCH-266 URL: http://issues.apache.org/jira/browse/NUTCH-266 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Environment: windows xp, JDK 1.4.2_04 Reporter: Eugen Kochuev Fix For: 0.9.0, 0.8.1 Attachments: patch.diff, patch_hadoop-0.5.0.diff I constantly get the following error message 060508 230637 Running job: job_pbhn3t 060508 230637 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258 060508 230637 job_pbhn3t java.io.IOException: Target /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62) at org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191) at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54) at org.apache.nutch.crawl.Crawl.main(Crawl.java:114) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-330) command line tool to search a Lucene index
[ http://issues.apache.org/jira/browse/NUTCH-330?page=comments#action_12426629 ] Renaud Richardet commented on NUTCH-330: This bug is obsolte, I just found out that Nutch already allows to search from the command line via bin/nutch org.apache.nutch.searcher.NutchBean [searchterm]. It assumes that you call it from the base of your crawl directory. command line tool to search a Lucene index -- Key: NUTCH-330 URL: http://issues.apache.org/jira/browse/NUTCH-330 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.8 Environment: ubuntu Reporter: Renaud Richardet Priority: Minor Attachments: clSearch.diff, clSearch.diff Tool to allow to search a Lucene index from the command line, makes development and testing faster usage: bin/nutch searchindex [index dir] [searchkeyword] example: bin/nutch searchindex crawl/index flowers -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-266) hadoop bug when doing updatedb
[ http://issues.apache.org/jira/browse/NUTCH-266?page=all ] Renaud Richardet updated NUTCH-266: --- Attachment: patch_hadoop-0.5.0.diff Now that Hadoop 0.5 has been released, here's the patch to use hadoop-0.5.0.jar in Nutch-0.8.x HTH, Renaud hadoop bug when doing updatedb -- Key: NUTCH-266 URL: http://issues.apache.org/jira/browse/NUTCH-266 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Environment: windows xp, JDK 1.4.2_04 Reporter: Eugen Kochuev Fix For: 0.9.0, 0.8.1 Attachments: patch.diff, patch_hadoop-0.5.0.diff I constantly get the following error message 060508 230637 Running job: job_pbhn3t 060508 230637 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258 060508 230637 job_pbhn3t java.io.IOException: Target /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62) at org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191) at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54) at org.apache.nutch.crawl.Crawl.main(Crawl.java:114) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-266) hadoop bug when doing updatedb
[ http://issues.apache.org/jira/browse/NUTCH-266?page=all ] Renaud Richardet updated NUTCH-266: --- Attachment: patch.diff Thank you Sami, We had a similar problem with Win XP and were able to fix it by using hadoop-nightly.jar. However, because of some changes in Hadoop (http://issues.apache.org/jira/browse/HADOOP-252), Nutch would not compile anymore. The attached patch will solve this. Let us know if there is a better way. hadoop bug when doing updatedb -- Key: NUTCH-266 URL: http://issues.apache.org/jira/browse/NUTCH-266 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Environment: windows xp, JDK 1.4.2_04 Reporter: Eugen Kochuev Attachments: patch.diff I constantly get the following error message 060508 230637 Running job: job_pbhn3t 060508 230637 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258 060508 230637 job_pbhn3t java.io.IOException: Target /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62) at org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191) at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54) at org.apache.nutch.crawl.Crawl.main(Crawl.java:114) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-208) http: proxy exception list:
[ http://issues.apache.org/jira/browse/NUTCH-208?page=all ] Renaud Richardet updated NUTCH-208: --- Attachment: proxy_exception_list-0.8.diff I updated the patch to 0.8 and corrected small typo (if (!.equals(input[i].trim())){ ). The proxy exception list feature works well. You can test it using any proxy, eg tinyproxy (http://wiki.apache.org/nutch/SetupProxyForNutch) http: proxy exception list: --- Key: NUTCH-208 URL: http://issues.apache.org/jira/browse/NUTCH-208 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8 Reporter: Matthias Günter Priority: Minor Attachments: patch.txt, patch.txt, proxy_exception_list-0.8.diff I suggest that a parameter is added to nutch-default.xml which allows to generate a proxy exception list. property namehttp.proxy.exception.list/name value/value descriptionURL's and hosts that don't use the proxy (e.g. intranets)/description /property This is useful when scanning intranet/internet combinations from behind a firewall. A preliminary patch is added to this extend to this request, showing the changes. We will test it and update it if necessary. this also reflects the reality in web browsers, where there is in most cases an exception list. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-330) command line tool to search a Lucene index
command line tool to search a Lucene index -- Key: NUTCH-330 URL: http://issues.apache.org/jira/browse/NUTCH-330 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.8-dev Environment: ubuntu Reporter: Renaud Richardet Priority: Minor Attachments: clSearch.diff Tool to allow to search a Lucene index from the command line, makes development and testing faster usage: bin/nutch searchindex [index dir] [searchkeyword] example: bin/nutch searchindex crawl/index flowers -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-330) command line tool to search a Lucene index
[ http://issues.apache.org/jira/browse/NUTCH-330?page=all ] Renaud Richardet updated NUTCH-330: --- Attachment: clSearch.diff unified diff against head command line tool to search a Lucene index -- Key: NUTCH-330 URL: http://issues.apache.org/jira/browse/NUTCH-330 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.8-dev Environment: ubuntu Reporter: Renaud Richardet Priority: Minor Attachments: clSearch.diff Tool to allow to search a Lucene index from the command line, makes development and testing faster usage: bin/nutch searchindex [index dir] [searchkeyword] example: bin/nutch searchindex crawl/index flowers -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-330) command line tool to search a Lucene index
[ http://issues.apache.org/jira/browse/NUTCH-330?page=all ] Renaud Richardet updated NUTCH-330: --- Attachment: clSearch.diff forgot the echo in sh... command line tool to search a Lucene index -- Key: NUTCH-330 URL: http://issues.apache.org/jira/browse/NUTCH-330 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.8-dev Environment: ubuntu Reporter: Renaud Richardet Priority: Minor Attachments: clSearch.diff, clSearch.diff Tool to allow to search a Lucene index from the command line, makes development and testing faster usage: bin/nutch searchindex [index dir] [searchkeyword] example: bin/nutch searchindex crawl/index flowers -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira