[
https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221598#comment-13221598
]
Markus Jelsma commented on NUTCH-1282:
--------------------------------------
There is an issue for that. In my opinion with that issue implemented the
current linkdb can be deprecated. Please check NUTCH-1181 if you have a patch
for this.
> linkdb scalability
> ------------------
>
> Key: NUTCH-1282
> URL: https://issues.apache.org/jira/browse/NUTCH-1282
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 1.4
> Reporter: behnam nikbakht
>
> as described in NUTCH-1054, the linkdb is optional in solrindex and it's
> usage is only for anchor and not impact on scoring.
> as seemed, size of linkdb in incremental crawl grow very fast and make it
> unscalable for huge size of web sites.
> so, here is two choises, one, ignore invertlinks and linkdb from crawl, and
> second, make it scalable
> in invertlinks, there is 2 jobs, first for construct new linkdb from new
> parsed segments, and second for merge new linkdb with old linkdb. the second
> job is unscalable and we can ignore it with this changes in solrIndex:
> in the class IndexerMapReduce, reduce method, if fetchDatum == null or
> dbDatum == null or parseText == null or parseData == null, then add anchor to
> doc and update solr (no insert)
> here also some changes required to NutchDocument.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira