Re: Infinite loop bug in Nutch 0.9

2009-04-02 Thread Doğacan Güney
On Wed, Apr 1, 2009 at 13:29, George Herlin ghher...@gmail.com wrote: Sorry, forgot to say, there is an added precondition to causing the bug: The redirection has to be fetched before the page it redirects to... if not, there will be a pre.existing crawl datum with an reasonable

Re: Infinite loop bug in Nutch 0.9

2009-04-02 Thread George Herlin
Indeed I have... that's how I found out. My test case: crawl http://www.purdue.ca/research/research_clinical.asp with crawl-urlfilter and regex-urlfilter ending with #purdue +^http://www.purdue.ca/research/ +^http://www.purdue.ca/pdf/ # reject anything else -. The site is very small (which

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-04-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694942#action_12694942 ] Julien Nioche commented on NUTCH-692: - As I pointed out in my previous message the root

[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986 ] Doğacan Güney commented on NUTCH-721: - I've committed nutch 0.9 fetcher as OldFetcher.

[jira] Issue Comment Edited: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986 ] Doğacan Güney edited comment on NUTCH-721 at 4/2/09 6:01 AM: -

Nutch Topical / Focused Crawl

2009-04-02 Thread MyD
Hi @ all, I'd like to turn Nutch into an focused / topical crawler. It's a part of my final year thesis. Further, I'd like that others can contribute from my work. I started to analyze the code and think that I found the right peace of code. I just wanted to know if I am on the right

Re: Nutch Topical / Focused Crawl

2009-04-02 Thread Ken Krugler
Hi @ all, I'd like to turn Nutch into an focused / topical crawler. It's a part of my final year thesis. Further, I'd like that others can contribute from my work. I started to analyze the code and think that I found the right peace of code. I just wanted to know if I am on the right track.

Re: Infinite loop bug in Nutch 0.9

2009-04-02 Thread Julien Nioche
George, Try using Nutch-1.0 instead. I have tested your example with the SVN version and it did not get into the problem you described. J. 2009/4/2 George Herlin ghher...@gmail.com Indeed I have... that's how I found out. My test case: crawl

Using keywords metatags

2009-04-02 Thread Rodrigo Reyes C.
Hi all. I would like to add keywords to the information that gets inserted into the Lucene Indexes. I am thinking I need to insert them into the WebDB and later on insert them into the Lucene indexes. Am I right? Which extension points do I need to use? Thanks in advance -- Rodrigo Reyes

[jira] Updated: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-04-02 Thread Cosmin Lehene (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated NUTCH-692: Attachment: NUTCH-692.patch This just checks the destination file existence before attempting to

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-04-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695122#action_12695122 ] Doğacan Güney commented on NUTCH-692: - Thanks for the patch. Patch looks good to me.

[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread Roger Dunk (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695170#action_12695170 ] Roger Dunk commented on NUTCH-721: -- For the following tests I've used the same segment

[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-04-02 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695233#action_12695233 ] Hudson commented on NUTCH-721: -- Integrated in Nutch-trunk #772 (See