[ 
https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1282:
----------------------------------------

    Fix Version/s: 1.7
    
> linkdb scalability
> ------------------
>
>                 Key: NUTCH-1282
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1282
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>             Fix For: 1.7
>
>
> as described in NUTCH-1054, the linkdb is optional in solrindex and it's 
> usage is only for anchor and not impact on scoring. 
> as seemed, size of linkdb in incremental crawl grow very fast and make it 
> unscalable for huge size of web sites.
> so, here is two choises, one, ignore invertlinks and linkdb from crawl, and 
> second, make it scalable
> in invertlinks, there is 2 jobs, first for construct new linkdb from new 
> parsed segments, and second for merge new linkdb with old linkdb. the second 
> job is unscalable and we can ignore it with this changes in solrIndex:
> in the class IndexerMapReduce, reduce method, if fetchDatum == null or 
> dbDatum == null or parseText == null or parseData == null, then add anchor to 
> doc and update solr (no insert)
> here also some changes required to NutchDocument.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to