[
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056690#comment-15056690
]
Lewis John McGibbney commented on NUTCH-2184:
---------------------------------------------
This issue also improves command line parsing for the IndexingJob tool with the
following help if invoked without arguments
{code}
lmcgibbn@LMC-032857 /usr/local/trunk_new(joshua) $ ./runtime/local/bin/nutch
index
Failed to parse command line Did not see expected # of arguments, saw 0
usage: IndexingJob [-crawldb <crawldb>] [-linkdb <linkdb>] [-params
k1=v1&k2=v2...] (<segment> ... | -dir <segments>)
[-noCommit] [-deleteGone] [-filter] [-normalize]
[-addBinaryContent] [-base64]
-abc,--addBinaryContent add the raw content of the document to the
indexing job (optional)
-b,--base64 if raw content is added, base64 encode it
(optional)
-c,--crawldb <arg> a crawldb directory to use with this tool
(optional)
-dg,--deleteGone delete gone documents e.g. documents which no
longer exist at the particular resource
(optional)
-f,--filter filter documents (optional)
-l,--linkdb <arg> a linkdb directory to use with this tool
(optional)
-n,--normalize normalize documents (optional)
-nc,--noCommit do the commits once and for all the reducers in
one go (optional)
-p,--params <arg> key value parameters to be used with this tool
e.g. k1=v1&k2=v2... (optional)
-s,--segment <arg> a single segment directory to be used with this
tool (either this or -segmentDir is mandatory)
-sd,--segmentDir <arg> a directory containing one or more segments to
be used with this tool (either this or -segment
is mandatory)
Active IndexWriters :
SolrIndexWriter
solr.server.type : Type of SolrServer to communicate with (default
'http' however options include 'cloud', 'lb' and 'concurrent')
solr.server.url : URL of the Solr instance (mandatory)
solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud'
value for solr.server.type)
solr.loadbalance.urls : Comma-separated string of Solr server strings
to be used (madatory if 'lb' value for solr.server.type)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.commit.size : buffer size when sending to Solr (default 1000)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
{code}
> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> Sometimes when working with distributed team(s), we have found that we can
> 'loose' data structures which are currently considered as critical e.g.
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no
> accompanying crawldb or linkdb.
> Absence of the latter is OK as linkdb is optional however currently in
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
> crawldb is mandatory.
> This ticket should enhance the IndexerMapReduce code to support the use case
> where you ONLY have segments and want to force an index for every record
> present.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)