[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

Lewis John McGibbney (JIRA) Tue, 29 Dec 2015 15:58:27 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074453#comment-15074453
 ]


Lewis John McGibbney commented on NUTCH-2184:
---------------------------------------------

[~markus17] coming back to this one briefly, I've thought about the points 
you've raised and wanted to make the following points. 
  1. This proposed patch does not change core indexing functionality as such, 
it instead extends (improves???) it to permit indexing of just segments. 
  2. If you experience the scenarios you've highlighted (e.g. possible 
configurations for index. * .md and db.parsemeta.to.crawldb), then AFAICT 
nothing changes... if you have the crawldb then they are used, if not then they 
are not. If I am wrong here can you point to the code that I need to have a 
look at. I didn't originally put the index. * .md and db.parsemeta.to.crawldb 
functionality in place so a bit of guidance would be nice here.
  3. Finally, to address the following
bq. Also, what is going to happen to transient errors? Records with 
FETCH_STATUS_RETRY should be ignored.

On the final one I am not sure right now I will revisit this patch post 
vacation (2nd week January). If you can provide feedback on 2 above then I'll 
hammer my way through that first. 
Thanks [~markus17]

> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>
>                 Key: NUTCH-2184
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2184
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.12
>
>         Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

Reply via email to