[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

ASF GitHub Bot (Jira) Fri, 22 Nov 2019 09:45:42 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980357#comment-16980357
 ]


ASF GitHub Bot commented on NUTCH-2184:
---------------------------------------

sebastian-nagel commented on pull request #486: NUTCH-2184 Enable IndexingJob 
to function with no crawldb
URL: https://github.com/apache/nutch/pull/486
 
 
   This PR obsoletes #95 (parts of the work are already done in 
[NUTCH-2456](https://issues.apache.org/jira/browse/NUTCH-2456)/#240). It
   - makes the CrawlDb argument passed to indexing job optional
   - but does not change the behavior of the indexing job otherwise
   - if there are non-optional arguments, the first of them is expected to be 
the CrawlDb unless `-nocrawldb` is given
   - it picks various improvements from PR #95
   - and improves the command-line help:
   ```
   Usage: Indexer (<crawldb> | -nocrawldb) (<segment> ... | -dir <segments>) 
[general options]
   
   Index given segments using configured indexer plugins
   
   The CrawlDb is optional but it is required to send deletion requests for 
duplicates
   and to read the proper document score/boost/weight passed to the indexers.
   
   Required arguments:
   
           <crawldb>       path to CrawlDb, or
           -nocrawldb      flag to indicate that no CrawlDb shall be used
   
           <segment> ...   path(s) to segment, or
           -dir <segments> path to segments/ directory,
                           (all subdirectories are read as segments)
   
   General options:
   
           -linkdb <linkdb>        use LinkDb to index anchor texts of incoming 
links
           -params k1=v1&k2=v2...  parameters passed to indexer plugins
                                   (via property indexer.additional.params)
   
           -noCommit       do not call the commit method of indexer plugins
           -deleteGone     send deletion requests for 404s, redirects, 
duplicates
           -filter         skip documents with URL rejected by configured URL 
filters
           -normalize      normalize URLs before indexing
           -addBinaryContent       index raw/binary content in field 
`binaryContent`
           -base64         use Base64 encoding for binary content
   ```
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>
>                 Key: NUTCH-2184
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2184
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.17
>
>         Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

Reply via email to