Re: OPIC scoring differences

2007-07-09 Thread Doğacan Güney
Hi, On 7/9/07, Carl Cerecke [EMAIL PROTECTED] wrote: Hi, The docs for the OPICScoringFilter mention that the plugin implements a variant of OPIC from Artiboul et al's paper. What exactly is different? How does the difference affect the scores? Also, there's a comment in the code: // XXX (ab)

[jira] Commented: (NUTCH-509) Update Crawldb: avoid to start a job if there is no valid segment

2007-07-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511036 ] Doğacan Güney commented on NUTCH-509: - We should start a job even if there aren't any valid segments. One may

[jira] Commented: (NUTCH-509) Update Crawldb: avoid to start a job if there is no valid segment

2007-07-09 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511038 ] Emmanuel Joke commented on NUTCH-509: - You're right. In this case, I will close the JIRA Update Crawldb: avoid

[jira] Closed: (NUTCH-509) Update Crawldb: avoid to start a job if there is no valid segment

2007-07-09 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke closed NUTCH-509. --- Resolution: Won't Fix As explain by Doğacan, the Crawldb update has a good behaviour. This patch is

[jira] Closed: (NUTCH-507) lib-lucene-analyzers jar defintion is wrong in plugin.xml

2007-07-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-507. --- Issue resolved and committed. lib-lucene-analyzers jar defintion is wrong in plugin.xml

[jira] Created: (NUTCH-510) IndexMerger delete working dir

2007-07-09 Thread Enis Soztutar (JIRA)
IndexMerger delete working dir -- Key: NUTCH-510 URL: https://issues.apache.org/jira/browse/NUTCH-510 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.0.0

[jira] Resolved: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-07-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney resolved NUTCH-503. - Resolution: Fixed Fix Version/s: (was: 0.8.2) 1.0.0

[jira] Updated: (NUTCH-510) IndexMerger delete working dir

2007-07-09 Thread Enis Soztutar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-510: Attachment: index.merger.delete.temp.dirs.patch Attached patch deletes working dirs on finally

spam detect

2007-07-09 Thread anton
Hello! Does nutch have any modules for spam detect? Does anyone know where I can find any information (blogs, articles, FAQ) about it?

Re: OPIC scoring differences

2007-07-09 Thread Andrzej Bialecki
Carl Cerecke wrote: Hi, The docs for the OPICScoringFilter mention that the plugin implements a variant of OPIC from Artiboul et al's paper. What exactly is different? How does the difference affect the scores? As it is now, the implementation doesn't preserve the total cash value in the

[jira] Issue Comment Edited: (NUTCH-510) IndexMerger delete working dir

2007-07-09 Thread Enis Soztutar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511043 ] Enis Soztutar edited comment on NUTCH-510 at 7/9/07 5:32 AM: - Attached patch deletes

[jira] Commented: (NUTCH-508) ${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker

2007-07-09 Thread Enis Soztutar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511121 ] Enis Soztutar commented on NUTCH-508: - Tasktracker invokes another jvm calling TaskTracker$Child but

Re: URL Injection with another source than text files

2007-07-09 Thread Epo Jemba
Hello guys, perhaps i'm in the wrong mailing list. May someone can help me regarding my needs ? Thank you 2007/7/4, Epo Jemba [EMAIL PROTECTED]: Hello , I'm new to nutch and I have a question regarding url injection mechanism. If I well understood, the source of the actual urls injection

Not renewing CrawlDatum on Inject

2007-07-09 Thread Robert Young
I have been trying to get to grips with org.apache.nutch.crawl.Injector to help with a requirement I have for the project I'm working on and I'm a little confused about one place. On lines 120 - 121 any existing CrawlDatum is used instead of the newly injected one. This doesn't seem to make sense

Re: Not renewing CrawlDatum on Inject

2007-07-09 Thread Andrzej Bialecki
Robert Young wrote: I have been trying to get to grips with org.apache.nutch.crawl.Injector to help with a requirement I have for the project I'm working on and I'm a little confused about one place. On lines 120 - 121 any existing CrawlDatum is used instead of the newly injected one. This

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-07-09 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511330 ] Hudson commented on NUTCH-503: -- Integrated in Nutch-Nightly #145 (See

[jira] Commented: (NUTCH-507) lib-lucene-analyzers jar defintion is wrong in plugin.xml

2007-07-09 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511329 ] Hudson commented on NUTCH-507: -- Integrated in Nutch-Nightly #145 (See