[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

Markus Jelsma (JIRA) Mon, 23 Mar 2015 05:20:47 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375787#comment-14375787
 ]


Markus Jelsma commented on NUTCH-1958:
--------------------------------------

Hello Julien - neither. Scoring-depth does not really assign a score to a 
document in same sense as opic or webgraph. OPIC is flaud for any crawl where 
pages are going to get refetched, and enabling webgraph by default is perhaps a 
bit too much in sense of performance and that it does not automatically 
converge to a stable state (# cycles are predefined).

If you do a single crawl without refetching, i.e. get all pages of a domain, 
OPIC is not required. If you are going to crawl everything anyway, then 
prioritizing is useless.

What do you think?



> Remove scoring-opic from nutch-default.xml
> ------------------------------------------
>
>                 Key: NUTCH-1958
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1958
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.3, 1.9
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 2.4, 1.10
>
>
> I propose we remove scoring-opic from nutch-default. We all know it is flawed 
> for any kind of incremental crawl, which most of us do. It is also useless if 
> you want to perform a single crawl, if you must crawl all records of a 
> domain, using OPIC for prioritizing URLS makes no sense. It also confuses 
> users as we have seen in the past and recently [1].
> What do you think?
> [1]: 
> http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

Reply via email to