linkdb scalability
------------------
Key: NUTCH-1282
URL: https://issues.apache.org/jira/browse/NUTCH-1282
Project: Nutch
Issue Type: Improvement
Components: linkdb
Affects Versions: 1.4
Reporter: behnam nikbakht
as described in NUTCH-1054, the linkdb is optional in solrindex and it's usage
is only for anchor and not impact on scoring.
as seemed, size of linkdb in incremental crawl grow very fast and make it
unscalable for huge size of web sites.
so, here is two choises, one, ignore invertlinks and linkdb from crawl, and
second, make it scalable
in invertlinks, there is 2 jobs, first for construct new linkdb from new parsed
segments, and second for merge new linkdb with old linkdb. the second job is
unscalable and we can ignore it with this changes in solrIndex:
in the class IndexerMapReduce, reduce method, if fetchDatum == null or dbDatum
== null or parseText == null or parseData == null, then add anchor to doc and
update solr (no insert)
here also some changes required to NutchDocument.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira