[jira] [Commented] (NUTCH-1282) linkdb scalability

Markus Jelsma (Commented) (JIRA) Sat, 03 Mar 2012 06:48:24 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221598#comment-13221598
 ]


Markus Jelsma commented on NUTCH-1282:
--------------------------------------

There is an issue for that. In my opinion with that issue implemented the 
current linkdb can be deprecated.  Please check NUTCH-1181 if you have a patch 
for this.
                
> linkdb scalability
> ------------------
>
>                 Key: NUTCH-1282
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1282
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>
> as described in NUTCH-1054, the linkdb is optional in solrindex and it's 
> usage is only for anchor and not impact on scoring. 
> as seemed, size of linkdb in incremental crawl grow very fast and make it 
> unscalable for huge size of web sites.
> so, here is two choises, one, ignore invertlinks and linkdb from crawl, and 
> second, make it scalable
> in invertlinks, there is 2 jobs, first for construct new linkdb from new 
> parsed segments, and second for merge new linkdb with old linkdb. the second 
> job is unscalable and we can ignore it with this changes in solrIndex:
> in the class IndexerMapReduce, reduce method, if fetchDatum == null or 
> dbDatum == null or parseText == null or parseData == null, then add anchor to 
> doc and update solr (no insert)
> here also some changes required to NutchDocument.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1282) linkdb scalability

Reply via email to