[
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074453#comment-15074453
]
Lewis John McGibbney commented on NUTCH-2184:
---------------------------------------------
[~markus17] coming back to this one briefly, I've thought about the points
you've raised and wanted to make the following points.
1. This proposed patch does not change core indexing functionality as such,
it instead extends (improves???) it to permit indexing of just segments.
2. If you experience the scenarios you've highlighted (e.g. possible
configurations for index. * .md and db.parsemeta.to.crawldb), then AFAICT
nothing changes... if you have the crawldb then they are used, if not then they
are not. If I am wrong here can you point to the code that I need to have a
look at. I didn't originally put the index. * .md and db.parsemeta.to.crawldb
functionality in place so a bit of guidance would be nice here.
3. Finally, to address the following
bq. Also, what is going to happen to transient errors? Records with
FETCH_STATUS_RETRY should be ignored.
On the final one I am not sure right now I will revisit this patch post
vacation (2nd week January). If you can provide feedback on 2 above then I'll
hammer my way through that first.
Thanks [~markus17]
> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can
> 'loose' data structures which are currently considered as critical e.g.
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no
> accompanying crawldb or linkdb.
> Absence of the latter is OK as linkdb is optional however currently in
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
> crawldb is mandatory.
> This ticket should enhance the IndexerMapReduce code to support the use case
> where you ONLY have segments and want to force an index for every record
> present.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)