RE: Host-level stats, ranking and recrawl

2007-09-18 Thread Vishal Shah
Hi Andrzej, This sounds like a good addition to the current system IMO. It would especially be helpful for building a generic web search or for building a domain-specific search where you would have an algorithm to prioritize which sites to crawl for your domain. I would go one step further

[jira] Created: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

2007-09-18 Thread Susam Pal (JIRA)
protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication -- Key: NUTCH-557 URL: https://issues.apache.org/jira/browse/NUTCH-557 Project: Nutch Issue

[jira] Updated: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

2007-09-18 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-557: Attachment: protocol-http11v0.1.patch I have generated this patch against Nutch trunk. To apply:- patch

[jira] Resolved: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no protocol

2007-09-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-554. - Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki Patch

[jira] Updated: (NUTCH-557) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

2007-09-18 Thread Susam Pal (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susam Pal updated NUTCH-557: Priority: Minor (was: Major) protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authentication

[jira] Closed: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no protocol

2007-09-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-554. --- Generator throws java.io.IOException and dies on injected urls with no protocol

Re: Host-level stats, ranking and recrawl

2007-09-18 Thread Doğacan Güney
Hi, On 9/17/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I was recently reading again some scoring-related papers, and found some interesting data in a paper by Baeza-Yates et al, Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering

Re: Scoring API issues (LONG)

2007-09-18 Thread Doğacan Güney
Hi, I think the ideas here are brilliant. A big +1 from me. I have one minor suggestion that I detail below. On 9/13/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi all, I've been working recently on a custom scoring plugin, and I found out some issues with the scoring API that severely

Re: Scoring API issues (LONG)

2007-09-18 Thread Andrzej Bialecki
Doğacan Güney wrote: public void prepareInjectorConfig(Path crawlDb, Path urls, Configuration config); public void prepareGeneratorConfig(Path crawlDb, Configuration config); public void prepareIndexerConfig(Path crawlDb, Path linkDb, Path[] segments, Configuration config); public void

[jira] Commented: (NUTCH-554) Generator throws java.io.IOException and dies on injected urls with no protocol

2007-09-18 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528658 ] Hudson commented on NUTCH-554: -- Integrated in Nutch-Nightly #211 (See