[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

Lewis John McGibbney (JIRA) Mon, 14 Dec 2015 12:50:50 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056690#comment-15056690
 ]


Lewis John McGibbney commented on NUTCH-2184:
---------------------------------------------

This issue also improves command line parsing for the IndexingJob tool with the 
following help if invoked without arguments
{code}
lmcgibbn@LMC-032857 /usr/local/trunk_new(joshua) $ ./runtime/local/bin/nutch 
index
Failed to parse command line Did not see expected # of arguments, saw 0
usage: IndexingJob [-crawldb <crawldb>] [-linkdb <linkdb>] [-params
                   k1=v1&k2=v2...] (<segment> ... | -dir <segments>)
                   [-noCommit] [-deleteGone] [-filter] [-normalize]
                   [-addBinaryContent] [-base64]
 -abc,--addBinaryContent   add the raw content of the document to the
                           indexing job (optional)
 -b,--base64               if raw content is added, base64 encode it
                           (optional)
 -c,--crawldb <arg>        a crawldb directory to use with this tool
                           (optional)
 -dg,--deleteGone          delete gone documents e.g. documents which no
                           longer exist at the particular resource
                           (optional)
 -f,--filter               filter documents (optional)
 -l,--linkdb <arg>         a linkdb directory to use with this tool
                           (optional)
 -n,--normalize            normalize documents (optional)
 -nc,--noCommit            do the commits once and for all the reducers in
                           one go (optional)
 -p,--params <arg>         key value parameters to be used with this tool
                           e.g. k1=v1&k2=v2... (optional)
 -s,--segment <arg>        a single segment directory to be used with this
                           tool (either this or -segmentDir is mandatory)
 -sd,--segmentDir <arg>    a directory containing one or more segments to
                           be used with this tool (either this or -segment
                           is mandatory)
Active IndexWriters :
SolrIndexWriter
        solr.server.type : Type of SolrServer to communicate with (default 
'http' however options include 'cloud', 'lb' and 'concurrent')
        solr.server.url : URL of the Solr instance (mandatory)
        solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' 
value for solr.server.type)
        solr.loadbalance.urls : Comma-separated string of Solr server strings 
to be used (madatory if 'lb' value for solr.server.type)
        solr.mapping.file : name of the mapping file for fields (default 
solrindex-mapping.xml)
        solr.commit.size : buffer size when sending to Solr (default 1000)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication
{code}

> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>
>                 Key: NUTCH-2184
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2184
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.12
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

Reply via email to