[jira] [Updated] (NUTCH-1300) Indexer to normalize URL's

2012-03-07 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1300:
-

Attachment: NUTCH-1300-1.5-1.patch

Patch for 1.5.

> Indexer to normalize URL's
> --
>
> Key: NUTCH-1300
> URL: https://issues.apache.org/jira/browse/NUTCH-1300
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.5
>
> Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new 
> normalizer is applied to the entire CrawlDB. Without it, some or all records 
> in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1303) Fetcher to skip queues for URLS getting repeated exceptions, based on percentage

2012-03-07 Thread behnam nikbakht (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

behnam nikbakht updated NUTCH-1303:
---

Attachment: NUTCH-1303.patch

> Fetcher to skip queues for URLS getting repeated exceptions, based on 
> percentage
> 
>
> Key: NUTCH-1303
> URL: https://issues.apache.org/jira/browse/NUTCH-1303
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.4
>Reporter: behnam nikbakht
>  Labels: fetch
> Attachments: NUTCH-1303.patch
>
>
> as described in https://issues.apache.org/jira/browse/NUTCH-769, it is a good 
> solution to skip queues with high exception value, but it is not easy to set 
> value of fetcher.max.exceptions.per.queue when size of queues are different.
> i suggest that define a ratio instead of value, so if the ratio of exceptions 
> per requests exceeds, then queue cleared.
> also, it is not sufficient to keep fetcher from high exceptions, value of 
> fetcher.throughput.threshold.pages ensures that a valueable throughput of 
> fetch can gained against slow hosts, but it clean all queues not slow queue. 
> i suggest for this one that this factor like fetcher.max.exceptions.per.queue 
> enforce to each queue not all of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira