On Wed, Apr 1, 2009 at 13:29, George Herlin ghher...@gmail.com wrote:
Sorry, forgot to say, there is an added precondition to causing the bug:
The redirection has to be fetched before the page it redirects to... if
not, there will be a pre.existing crawl datum with an reasonable
Indeed I have... that's how I found out.
My test case: crawl
http://www.purdue.ca/research/research_clinical.asp
with crawl-urlfilter and regex-urlfilter ending with
#purdue
+^http://www.purdue.ca/research/
+^http://www.purdue.ca/pdf/
# reject anything else
-.
The site is very small (which
[
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694942#action_12694942
]
Julien Nioche commented on NUTCH-692:
-
As I pointed out in my previous message the root
[
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986
]
Doğacan Güney commented on NUTCH-721:
-
I've committed nutch 0.9 fetcher as OldFetcher.
[
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694986#action_12694986
]
Doğacan Güney edited comment on NUTCH-721 at 4/2/09 6:01 AM:
-
Hi @ all,
I'd like to turn Nutch into an focused / topical crawler. It's a part
of my final year thesis. Further, I'd like that others can contribute
from my work. I started to analyze the code and think that I found the
right peace of code. I just wanted to know if I am on the right
Hi @ all,
I'd like to turn Nutch into an focused / topical crawler. It's a
part of my final year thesis. Further, I'd like that others can
contribute from my work. I started to analyze the code and think
that I found the right peace of code. I just wanted to know if I am
on the right track.
George,
Try using Nutch-1.0 instead. I have tested your example with the SVN version
and it did not get into the problem you described.
J.
2009/4/2 George Herlin ghher...@gmail.com
Indeed I have... that's how I found out.
My test case: crawl
Hi all. I would like to add keywords to the information that gets inserted
into the Lucene Indexes. I am thinking I need to insert them into the WebDB
and later on insert them into the Lucene indexes. Am I right? Which
extension points do I need to use?
Thanks in advance
--
Rodrigo Reyes
[
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cosmin Lehene updated NUTCH-692:
Attachment: NUTCH-692.patch
This just checks the destination file existence before attempting to
[
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695122#action_12695122
]
Doğacan Güney commented on NUTCH-692:
-
Thanks for the patch.
Patch looks good to me.
[
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695170#action_12695170
]
Roger Dunk commented on NUTCH-721:
--
For the following tests I've used the same segment
[
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695233#action_12695233
]
Hudson commented on NUTCH-721:
--
Integrated in Nutch-trunk #772 (See
13 matches
Mail list logo