[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-11-15 Thread Renaud Richardet (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542819
 ] 

Renaud Richardet commented on NUTCH-444:


hi,
i am travelling and will be offline until january 2008. thanks for
your patience.
Renaud

bonjour,
je suis en voyage et ne serai pas atteignable par mail avant janvier
2008. merci de votre patience.
Renaud

-- 
renaudatoslutionsdotcom
www.oslutions.com


 Possibly use a different library to parse RSS feed for improved performance 
 and compatibility
 -

 Key: NUTCH-444
 URL: https://issues.apache.org/jira/browse/NUTCH-444
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.0.0

 Attachments: feed.tar.bz2, NUTCH-444.1-1.patch, 
 NUTCH-444.Mattmann.061707.patch.txt, NUTCH-444.patch, parse-feed-v2.tar.bz2, 
 parse-feed.tar.bz2


 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
 library (feedparser) has the following issues:
 - OutOfMemory when parsing  100k feeds, since it has to convert the feed to 
 jdom first
 - no support for Atom 1.0
 - there has been no development in the last year
 Alternatives are:
 - Rome 
 - Informa
 - custom implementation based on Stax
 - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-540) some problem about the Nutch cache

2007-08-09 Thread Renaud Richardet (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Richardet updated NUTCH-540:
---

Priority: Major  (was: Blocker)

could you please attach log files and error messages? thanks

 some problem about the Nutch cache
 --

 Key: NUTCH-540
 URL: https://issues.apache.org/jira/browse/NUTCH-540
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
Reporter: crossany
 Fix For: 0.9.0


 I'am a chinese.
 I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on 
 linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website 
 it a chinese website the web charset it's also UTF-8. when Use the nutch on 
 tomcat for search chinese word , I find the search result' Title and 
 description was right to display. but when I click the cache, the cache web 
 was display a error charset code, I see the cache
 web' charset also utf-8. I find a website use Nutch 
 http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also 
 error.
 I use Luke to see the segments It's can display chinese word, I think maybe 
 it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.

2007-02-24 Thread Renaud Richardet (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Richardet updated NUTCH-369:
---

Attachment: patch.diff

unified diff against head.

- fixes encoding, as described by King Kong
- removes non-valid features
- fixes logging

 StringUtil.resolveEncodingAlias  is unuseful.
 -

 Key: NUTCH-369
 URL: https://issues.apache.org/jira/browse/NUTCH-369
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
 Environment: all
Reporter: King Kong
 Attachments: patch.diff


 After we defined encoding alias map in StringUtil , but parse html use 
 orginal encoding also.
 I found it is reading charset from  meta in nekohtml which HtmlParser  used .
 we can set it's feature 
 http://cyberneko.org/html/features/scanner/ignore-specified-charset; to true 
 that nekohtml will use encoding we set;
 concretely,
   private DocumentFragment parseNeko(InputSource input) throws Exception {
 DOMFragmentParser parser = new DOMFragmentParser();
 // some plugins, e.g., creativecommons, need to examine html comments
 try {
+ 
 parser.setFeature(http://cyberneko.org/html/features/scanner/ignore-specified-charset,true);
   parser.setFeature(http://apache.org/xml/features/include-comments;, 
   true);
   
 BTW, It must be add on front of try block,because the following sentence  
 (parser.setFeature(http://apache.org/xml/features/include-comments;, 
   true);) will throw exception.
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.

2007-02-24 Thread Renaud Richardet (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Richardet updated NUTCH-369:
---

Attachment: remover.diff

just FYI, you can further filter which element neko should keep and remove. see 
the patch for an example and 
http://people.apache.org/~andyc/neko/doc/html/settings.html 

 StringUtil.resolveEncodingAlias  is unuseful.
 -

 Key: NUTCH-369
 URL: https://issues.apache.org/jira/browse/NUTCH-369
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: all
Reporter: King Kong
Priority: Minor
 Attachments: patch.diff, remover.diff


 After we defined encoding alias map in StringUtil , but parse html use 
 orginal encoding also.
 I found it is reading charset from  meta in nekohtml which HtmlParser  used .
 we can set it's feature 
 http://cyberneko.org/html/features/scanner/ignore-specified-charset; to true 
 that nekohtml will use encoding we set;
 concretely,
   private DocumentFragment parseNeko(InputSource input) throws Exception {
 DOMFragmentParser parser = new DOMFragmentParser();
 // some plugins, e.g., creativecommons, need to examine html comments
 try {
+ 
 parser.setFeature(http://cyberneko.org/html/features/scanner/ignore-specified-charset,true);
   parser.setFeature(http://apache.org/xml/features/include-comments;, 
   true);
   
 BTW, It must be add on front of try block,because the following sentence  
 (parser.setFeature(http://apache.org/xml/features/include-comments;, 
   true);) will throw exception.
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-13 Thread Renaud Richardet (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472733
 ] 

Renaud Richardet commented on NUTCH-443:


hi All,

Glad to see that this patch is moving forward :-)
I have been carried away by a project, but will have some time this week. 
Please let me know if there is anything I can help on, especially on the 
parsers.

Cheers,
Renaud

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Chris A. Mattmann
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
 NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, 
 parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread Renaud Richardet (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Richardet updated NUTCH-443:
---

Attachment: NUTCH-443-draft-v4.patch

Hi Dogacan,

Thanks for merging the patches, good teamwork!

I worked on the RSS parser, it should now basically work.
Now, all core and plugin-tests pass, except for TestRSSparser, will work on 
that. Once this is in place, I will have a look at the other issues with fetch 
time, etc.

I merged my changes with your patch, version 3.


 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread Renaud Richardet (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471878
 ] 

Renaud Richardet commented on NUTCH-443:


Nutch Newbie, Gal, Chris

It's great that you discuss alternative RSS parsing libraries, bug the 
resolution of this bug does not depends on which underlying RSS library is used 
in RSSParser. Would you mind moving the conversation to the new issue I created 
for it (NUTCH-444), thanks a bunch.



 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-08 Thread Renaud Richardet (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Richardet updated NUTCH-443:
---

Attachment: parsers.diff

Great, here's my work-in-progress(not finished, not tested) for updating all 
parsers and extending the RSSparser.

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: parse-map-core-draft-v1.patch, 
 parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: FW: RSS-fecter and index individul-how can i realize this function

2007-02-08 Thread Renaud Richardet

HUYLEBROECK Jeremy RD-ILAB-SSF wrote:

I send again this message as it apparently didn't go through.
(I am messing up with my email addresses on the mailing list...) 


-Original Message-
Sent: Friday, February 02, 2007 10:29 AM

Using Nutch 0.8, we modified the code starting at the fetching/parsing steps 
and the following.
We have a different implementation of the Parse Object and OutputFormat 
including an additional list of ParseData objects saved in an additionnal 
subfolder in the DFS.
We changed the indexing step a lot too, so we don't use the nutch code there.
  
Is your implementation similar to what we started at 
https://issues.apache.org/jira/browse/NUTCH-443? If you think some of 
your changes could be integrated, please post a patch there.


Thanks for sharing,
Renaud


-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Friday, February 02, 2007 10:19 AM
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Attention, votre correspondant continue de vous écrire à votre ancienne adresse 
en @orange-ft.com, qui va être désactivée début avril. Veuillez lui demander de 
mettre à jour son carnet d'adresses avec votre nouvelle adresse en 
@orange-ftgroup.com.

Caution : your correspondent is still writing to your orange-ft.com address, 
which will be disabled beginning of April. Please ask him/her to update his/her 
address book to orange-ftgroup.com 
..

Gal Nitzan wrote:
  

IMHO the data that is needed i.e. the data that will be fetched in the next fetch process 
is already available in the item element. Each item element represents one 
web resource. And there is no reason to go to the server and re-fetch that resource.



Perhaps ProtocolOutput should change.  The method:

   Content getContent();

could be deprecated and replaced with:

   Content[] getContents();

This would require changes to the indexing pipeline.  I can't think of

any severe complications, but I haven't looked closely.

Could something like that work?

Doug


  



--
Renaud Richardet  +1 617 230 9112
my email is my first name at apache.org  http://www.oslutions.com



[jira] Created: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-07 Thread Renaud Richardet (JIRA)
allow parsers to return multiple Parse object, this will speed up the rss parser


 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0


allow Parser#parse to return a MapString,Parse. This way, the RSS parser can 
return multiple parse objects, that will all be indexed separately. Advantage: 
no need to fetch all feed-items separately.
see the discussion at 
http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Renaud Richardet

Doug Cutting wrote:

Renaud Richardet wrote:
I see. I was thinking that I could index the feed items without 
having to fetch them individually.


Okay, so if Parser#parse returned a MapString,Parse, then the URL 
for each parse should be that of its link, since you don't want to 
fetch that separately. Right?

Exactly.


So now the question is, how much impact would this change to the 
Parser API have on the rest of Nutch? It would require changes to all 
Parser implementations, to ParseSegement, to ParseUtil, and to 
Fetcher. But, as far as I can tell, most of these changes look 
straightforward.
I think so, too. I have opened an issue in JIRA 
(https://issues.apache.org/jira/browse/NUTCH-443) and will give it a try.

Doğacan, have you started working on it yet?

Thanks,
Renaud



Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Renaud Richardet

Hi Chris, Doug,

Chris Mattmann wrote:

Hi Doug,

  

Since the target of the link must still be indexed separately from the
item itself, how much use is all this?  If the RSS document is
considered a single page that changes frequently, and item's links are
considered ordinary outlinks, isn't much the same effect achieved?



IMHO, yes. That's what it's been hard for me to understand the real use case
for what Gal et al. are talking about. I've been trying to wrap my head
around it, but it seems to me the capability they require is sort of already
provided...
  
Not sure I understand: An RSS-feed is a collection of feed-entries, and 
each feed-entry would be indexed a a separate document (each feed-entry 
has a url or uuid as unique identifier).
What happens with the RSS-feed itself? Is it indexed, or considered as a 
container that just needs to be fetched and fetched again for new entries?


The usecase is that you index RSS-feeds, but your users can search each 
feed-entry as a single document. Does it makes sense?


Thanks,
Renaud


Cheers,
  Chris

  

Doug



__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



  



--
Renaud Richardet  +1 617 230 9112
my email is my first name at apache.org  http://www.oslutions.com



Re: api.RegexURLFilterBase - Configuration Resources

2007-02-06 Thread Renaud Richardet

Tobias Zahn wrote:

Hello!
I have written a new plugin extending the IndexingFilter and using the
RegexURLFilterBase class.
In the log there is this message:

FATAL api.RegexURLFilterBase - Can't find resource: null
  
in your new class CustomIndexingFilter, create a field Configuration 
conf, and implement setConf, getConf like this:


public void setConf(Configuration conf) {
   this.conf = conf;
 }

 public Configuration getConf() {
   return this.conf;
 }

and pass the conf object to RegexURLFilterBase before calling it.

RegexURLFilterBase r = new RegexURLFilter();
r.setConf(conf);
r.filter(sometext);

This should do the trick.

I assume you have setup the build configuration of your plugin 
correctly, was tricky for me ;-)


build.xml

!-- Build compilation dependencies --
target name=deps-jar
..
ant target=jar inheritall=false dir=../urlfilter-regex/
ant target=jar inheritall=false dir=../lib-regex-filter/
/target

!-- Add compilation dependencies to classpath --
path id=plugin.deps
  fileset dir=${nutch.root}/build
...
include name=**/urlfilter-regex/*.jar /
include name=**/lib-regex-filter/*.jar /
  /fileset
/path

and plugin.xml
requires
 import plugin=nutch-extensionpoints/
..
 import plugin=urlfilter-regex/

 import plugin=lib-regex-filter/
  /requires

HTH,
Renaud




I don't know how to handle that Configuration-Objects (setConf() etc.)
What should I do to avoid that error? Where does the
Configuration-Object come from?

TIA
Tobias Zahn

  



--
Renaud Richardet  +1 617 230 9112
my email is my first name at apache.org  http://www.oslutions.com



Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Renaud Richardet

Doug Cutting wrote:

Renaud Richardet wrote:
The usecase is that you index RSS-feeds, but your users can search 
each feed-entry as a single document. Does it makes sense?


But each feed item also contains a link whose content will be indexed 
and that's generally a superset of the item.  

Agreed
So should there be two urls indexed per item?  

I don't think so
In many cases, the best thing to do is to index only the linked page, 
not the feed item at all.  In some (rare?) cases, there might be items 
without a link, whose only content is directly in the feed, or where 
the content in the feed is complementary to that in the linked page.  
In these cases it might be useful to combine the two (the feed item 
and the linked content), indexing both.  The proposed change might 
permit that.  Is that the case you're concerned about?
I see. I was thinking that I could index the feed items without having 
to fetch them individually.


More fundamentally, I want to index only the blog-entry text, and not 
the elements around it (header, menus, ads, ...), so as to improve the 
search results.


Here's my case, the proposed changes would allow me to do (*)

1) parse feeds:

for each (feedentry : feed) do
|
|  if (full-text entries) then
|   |  index each feed entry as a single document; blog header, menus 
are not indexed. *

|  else
|   |  create a special outlink for each feed entry, which include 
metadata (content, time, etc)

|  endif
|
done

2) on a next fetch loop:

for each (link) do
|
|  if (this is a normal link)
||  fetch it and index it normally
|  else if (this link come from an already indexed feed entry) then
||  end, do not fetch it *
|  else if (this is a special outlink)
||  guess which DOM nodes hold the post content
||  index it; blog header, menus are not indexed.
|  endif
|
done


Thanks,
Renaud


Re: RSS-fecter and index individul-how can i realize this function

2007-02-02 Thread Renaud Richardet
 not reflect
those of either NASA, JPL, or the California Institute of Technology.





  



--
renaud richardet   +1 617 230 9112
renaud at oslutions.com http://www.oslutions.com



[jira] Updated: (NUTCH-412) plugin to parse the feed-url (rss/atom) of a blog

2006-12-03 Thread Renaud Richardet (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-412?page=all ]

Renaud Richardet updated NUTCH-412:
---

Attachment: plugin_parse-feedUrl2.diff

 plugin to parse the feed-url (rss/atom) of a blog
 -

 Key: NUTCH-412
 URL: http://issues.apache.org/jira/browse/NUTCH-412
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Attachments: plugin_parse-feedUrl.diff, plugin_parse-feedUrl2.diff


 A plugin that extracts the feed-url (rss/atom) of a blog by retrieving the 
 href from the headlink element (if found), and stores it in metadata. 
 The meta can be accessed with 
 parse.getData().getMeta(feedUrl);
 you can test this plugin with the main method of HtmlParser.
 Thanks for a feedback.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-412) plugin to parse the feed-url (rss/atom) of a blog

2006-12-02 Thread Renaud Richardet (JIRA)
plugin to parse the feed-url (rss/atom) of a blog
-

 Key: NUTCH-412
 URL: http://issues.apache.org/jira/browse/NUTCH-412
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor


A plugin that extracts the feed-url (rss/atom) of a blog by retrieving the href 
from the headlink element (if found), and stores it in metadata. 

The meta can be accessed with 
parse.getData().getMeta(feedUrl);
you can test this plugin with the main method of HtmlParser.

Thanks for a feedback.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-412) plugin to parse the feed-url (rss/atom) of a blog

2006-12-02 Thread Renaud Richardet (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-412?page=all ]

Renaud Richardet updated NUTCH-412:
---

Attachment: plugin_parse-feedUrl.diff

unified diff against head (Rev: 481445)

 plugin to parse the feed-url (rss/atom) of a blog
 -

 Key: NUTCH-412
 URL: http://issues.apache.org/jira/browse/NUTCH-412
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Attachments: plugin_parse-feedUrl.diff


 A plugin that extracts the feed-url (rss/atom) of a blog by retrieving the 
 href from the headlink element (if found), and stores it in metadata. 
 The meta can be accessed with 
 parse.getData().getMeta(feedUrl);
 you can test this plugin with the main method of HtmlParser.
 Thanks for a feedback.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-359) extraction of links will fail for whole page if one single link cannot be parsed

2006-08-24 Thread Renaud Richardet (JIRA)
extraction of links will fail for whole page if one single link cannot be parsed


 Key: NUTCH-359
 URL: http://issues.apache.org/jira/browse/NUTCH-359
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
 Environment: Ubuntu Dapper
Reporter: Renaud Richardet
Priority: Minor
 Attachments: outlink.diff

When Nutch parses the outlinks of a fetched page, the process will fail if a 
single link cannot be parsed (e.g. java.net.MalformedURLException: unknown 
protocol). The attached patch will keep indexing the remaining links on that 
page even if one fails.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [Fwd: Re: [Nutch Wiki] Update of RenaudRichardet by RenaudRichardet]

2006-08-24 Thread Renaud Richardet

Thank you Stephan,

I think you were meaning editing: 
http://wiki.apache.org/nutch/RunNutchInEclipse , not  
http://wiki.apache.org/nutch/RenaudRichardet  , right?


Yes it's definitively OK with me, that's why I put it in the Wiki. 
Please keep putting your improvements!


Thanks again,
Renaud



 Original Message 
Subject: Re: [Nutch Wiki] Update of RenaudRichardet by 
RenaudRichardet

Date: Wed, 23 Aug 2006 12:00:22 -0700
From: Stefan Groschupf [EMAIL PROTECTED]
Reply-To: nutch-dev@lucene.apache.org
To: nutch-dev@lucene.apache.org
CC: [EMAIL PROTECTED]
References: [EMAIL PROTECTED]



Hi Renaud,
I updated your page with some more details, I hope that is ok for you.
Thanks for creating it.
Stefan


Am 23.08.2006 um 11:51 schrieb Apache Wiki:


Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki  
for change notification.


The following page has been changed by RenaudRichardet:
http://wiki.apache.org/nutch/RenaudRichardet

New page:
{{{
Renaud Richardet
COO America
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195 mobile +1 617 230 9112
renaud.richardet at wyona.com  http://www.wyona.com
}}}







--
Renaud Richardet
COO America
Wyona-   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195  mobile +1 617 230 9112
renaud.richardet at wyona.com   http://www.wyona.com



[jira] Updated: (NUTCH-346) Improve readability of logs/hadoop.log

2006-08-21 Thread Renaud Richardet (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-346?page=all ]

Renaud Richardet updated NUTCH-346:
---

Attachment: log4j_plugins.diff

OK, here we go. This patch should be good for 0.8 and trunk.

 Improve readability of logs/hadoop.log
 --

 Key: NUTCH-346
 URL: http://issues.apache.org/jira/browse/NUTCH-346
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: ubuntu dapper
Reporter: Renaud Richardet
Priority: Minor
 Attachments: log4j_plugins.diff


 adding
 log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
 to conf/log4j.properties
 dramatically improves the readability of the logs in logs/hadoop.log (removes 
 all INFO)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-346) Improve readability of logs/hadoop.log

2006-08-09 Thread Renaud Richardet (JIRA)
Improve readability of logs/hadoop.log
--

 Key: NUTCH-346
 URL: http://issues.apache.org/jira/browse/NUTCH-346
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: ubuntu dapper
Reporter: Renaud Richardet
Priority: Minor


adding
log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
to conf/log4j.properties
dramatically improves the readability of the logs in logs/hadoop.log (removes 
all INFO)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-266) hadoop bug when doing updatedb

2006-08-08 Thread Renaud Richardet (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-266?page=comments#action_12426579 ] 

Renaud Richardet commented on NUTCH-266:


KuroSaka, yes you can download the hadoop jar, release 0.5.0 from the project 
website: http://lucene.apache.org/hadoop/ and 
http://www.apache.org/dyn/closer.cgi/lucene/hadoop/

 hadoop bug when doing updatedb
 --

 Key: NUTCH-266
 URL: http://issues.apache.org/jira/browse/NUTCH-266
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
 Environment: windows xp, JDK 1.4.2_04
Reporter: Eugen Kochuev
 Fix For: 0.9.0, 0.8.1

 Attachments: patch.diff, patch_hadoop-0.5.0.diff


 I constantly get the following error message
 060508 230637 Running job: job_pbhn3t
 060508 230637 
 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258
 060508 230637 job_pbhn3t
 java.io.IOException: Target 
 /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists
 at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
 at 
 org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
 at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-330) command line tool to search a Lucene index

2006-08-08 Thread Renaud Richardet (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-330?page=comments#action_12426629 ] 

Renaud Richardet commented on NUTCH-330:


This bug is obsolte, I just found out that Nutch already allows to search from 
the command line via
bin/nutch org.apache.nutch.searcher.NutchBean [searchterm]. It assumes that you 
call it from the base of your crawl directory.


 command line tool to search a Lucene index
 --

 Key: NUTCH-330
 URL: http://issues.apache.org/jira/browse/NUTCH-330
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.8
 Environment: ubuntu
Reporter: Renaud Richardet
Priority: Minor
 Attachments: clSearch.diff, clSearch.diff


 Tool to allow to search a Lucene index from the command line, makes 
 development and testing faster
 usage:   bin/nutch searchindex [index dir] [searchkeyword]
 example: bin/nutch searchindex crawl/index flowers

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-266) hadoop bug when doing updatedb

2006-08-07 Thread Renaud Richardet (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-266?page=all ]

Renaud Richardet updated NUTCH-266:
---

Attachment: patch_hadoop-0.5.0.diff

Now that Hadoop 0.5 has been released, here's the patch to use hadoop-0.5.0.jar 
in Nutch-0.8.x
HTH,
Renaud

 hadoop bug when doing updatedb
 --

 Key: NUTCH-266
 URL: http://issues.apache.org/jira/browse/NUTCH-266
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
 Environment: windows xp, JDK 1.4.2_04
Reporter: Eugen Kochuev
 Fix For: 0.9.0, 0.8.1

 Attachments: patch.diff, patch_hadoop-0.5.0.diff


 I constantly get the following error message
 060508 230637 Running job: job_pbhn3t
 060508 230637 
 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258
 060508 230637 job_pbhn3t
 java.io.IOException: Target 
 /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists
 at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
 at 
 org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
 at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-266) hadoop bug when doing updatedb

2006-08-02 Thread Renaud Richardet (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-266?page=all ]

Renaud Richardet updated NUTCH-266:
---

Attachment: patch.diff

Thank you Sami,

We had a similar problem with Win XP and were able to fix it by using 
hadoop-nightly.jar. However, because of some changes in Hadoop 
(http://issues.apache.org/jira/browse/HADOOP-252), Nutch would not compile 
anymore. The attached patch will solve this. Let us know if there is a better 
way.


 hadoop bug when doing updatedb
 --

 Key: NUTCH-266
 URL: http://issues.apache.org/jira/browse/NUTCH-266
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
 Environment: windows xp, JDK 1.4.2_04
Reporter: Eugen Kochuev
 Attachments: patch.diff


 I constantly get the following error message
 060508 230637 Running job: job_pbhn3t
 060508 230637 
 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258
 060508 230637 job_pbhn3t
 java.io.IOException: Target 
 /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists
 at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
 at 
 org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
 at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-208) http: proxy exception list:

2006-07-31 Thread Renaud Richardet (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-208?page=all ]

Renaud Richardet updated NUTCH-208:
---

Attachment: proxy_exception_list-0.8.diff

I updated the patch to 0.8 and corrected small typo (if 
(!.equals(input[i].trim())){  ). The proxy exception list feature works well. 
You can test it using any proxy, eg tinyproxy 
(http://wiki.apache.org/nutch/SetupProxyForNutch)

 http: proxy exception list:
 ---

 Key: NUTCH-208
 URL: http://issues.apache.org/jira/browse/NUTCH-208
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8
Reporter: Matthias Günter
Priority: Minor
 Attachments: patch.txt, patch.txt, proxy_exception_list-0.8.diff


 I suggest that a parameter is added to nutch-default.xml which allows to 
 generate a proxy exception list. 
 property
   namehttp.proxy.exception.list/name
   value/value
   descriptionURL's and hosts that don't use the proxy (e.g. 
 intranets)/description
 /property
 This is useful when scanning intranet/internet combinations from behind a 
 firewall. A preliminary patch is added to this extend to this request, 
 showing the changes. We will test it and update it if necessary. this also 
 reflects the reality in web browsers, where there is in most cases an 
 exception list.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-330) command line tool to search a Lucene index

2006-07-25 Thread Renaud Richardet (JIRA)
command line tool to search a Lucene index
--

 Key: NUTCH-330
 URL: http://issues.apache.org/jira/browse/NUTCH-330
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.8-dev
 Environment: ubuntu
Reporter: Renaud Richardet
Priority: Minor
 Attachments: clSearch.diff

Tool to allow to search a Lucene index from the command line, makes development 
and testing faster
usage:   bin/nutch searchindex [index dir] [searchkeyword]
example: bin/nutch searchindex crawl/index flowers

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-330) command line tool to search a Lucene index

2006-07-25 Thread Renaud Richardet (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-330?page=all ]

Renaud Richardet updated NUTCH-330:
---

Attachment: clSearch.diff

unified diff against head

 command line tool to search a Lucene index
 --

 Key: NUTCH-330
 URL: http://issues.apache.org/jira/browse/NUTCH-330
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.8-dev
 Environment: ubuntu
Reporter: Renaud Richardet
Priority: Minor
 Attachments: clSearch.diff


 Tool to allow to search a Lucene index from the command line, makes 
 development and testing faster
 usage:   bin/nutch searchindex [index dir] [searchkeyword]
 example: bin/nutch searchindex crawl/index flowers

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-330) command line tool to search a Lucene index

2006-07-25 Thread Renaud Richardet (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-330?page=all ]

Renaud Richardet updated NUTCH-330:
---

Attachment: clSearch.diff

forgot the echo in sh...

 command line tool to search a Lucene index
 --

 Key: NUTCH-330
 URL: http://issues.apache.org/jira/browse/NUTCH-330
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.8-dev
 Environment: ubuntu
Reporter: Renaud Richardet
Priority: Minor
 Attachments: clSearch.diff, clSearch.diff


 Tool to allow to search a Lucene index from the command line, makes 
 development and testing faster
 usage:   bin/nutch searchindex [index dir] [searchkeyword]
 example: bin/nutch searchindex crawl/index flowers

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira