[
https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-1282:
----------------------------------------
Fix Version/s: 1.7
> linkdb scalability
> ------------------
>
> Key: NUTCH-1282
> URL: https://issues.apache.org/jira/browse/NUTCH-1282
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 1.4
> Reporter: behnam nikbakht
> Fix For: 1.7
>
>
> as described in NUTCH-1054, the linkdb is optional in solrindex and it's
> usage is only for anchor and not impact on scoring.
> as seemed, size of linkdb in incremental crawl grow very fast and make it
> unscalable for huge size of web sites.
> so, here is two choises, one, ignore invertlinks and linkdb from crawl, and
> second, make it scalable
> in invertlinks, there is 2 jobs, first for construct new linkdb from new
> parsed segments, and second for merge new linkdb with old linkdb. the second
> job is unscalable and we can ignore it with this changes in solrIndex:
> in the class IndexerMapReduce, reduce method, if fetchDatum == null or
> dbDatum == null or parseText == null or parseData == null, then add anchor to
> doc and update solr (no insert)
> here also some changes required to NutchDocument.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira