[
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980357#comment-16980357
]
ASF GitHub Bot commented on NUTCH-2184:
---------------------------------------
sebastian-nagel commented on pull request #486: NUTCH-2184 Enable IndexingJob
to function with no crawldb
URL: https://github.com/apache/nutch/pull/486
This PR obsoletes #95 (parts of the work are already done in
[NUTCH-2456](https://issues.apache.org/jira/browse/NUTCH-2456)/#240). It
- makes the CrawlDb argument passed to indexing job optional
- but does not change the behavior of the indexing job otherwise
- if there are non-optional arguments, the first of them is expected to be
the CrawlDb unless `-nocrawldb` is given
- it picks various improvements from PR #95
- and improves the command-line help:
```
Usage: Indexer (<crawldb> | -nocrawldb) (<segment> ... | -dir <segments>)
[general options]
Index given segments using configured indexer plugins
The CrawlDb is optional but it is required to send deletion requests for
duplicates
and to read the proper document score/boost/weight passed to the indexers.
Required arguments:
<crawldb> path to CrawlDb, or
-nocrawldb flag to indicate that no CrawlDb shall be used
<segment> ... path(s) to segment, or
-dir <segments> path to segments/ directory,
(all subdirectories are read as segments)
General options:
-linkdb <linkdb> use LinkDb to index anchor texts of incoming
links
-params k1=v1&k2=v2... parameters passed to indexer plugins
(via property indexer.additional.params)
-noCommit do not call the commit method of indexer plugins
-deleteGone send deletion requests for 404s, redirects,
duplicates
-filter skip documents with URL rejected by configured URL
filters
-normalize normalize URLs before indexing
-addBinaryContent index raw/binary content in field
`binaryContent`
-base64 use Base64 encoding for binary content
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can
> 'loose' data structures which are currently considered as critical e.g.
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no
> accompanying crawldb or linkdb.
> Absence of the latter is OK as linkdb is optional however currently in
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
> crawldb is mandatory.
> This ticket should enhance the IndexerMapReduce code to support the use case
> where you ONLY have segments and want to force an index for every record
> present.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)