Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "IndexWriters" page has been changed by RoannelFernandez: https://wiki.apache.org/nutch/IndexWriters?action=diff&rev1=6&rev2=7 Comment: Description for each section - = Index writers configuration = - <<TableOfContents(4)>> + = Index writers in Nutch = + + An index writer is a component of the indexing job, which is used for sending documents from one or more segments to an external server. In Nutch, these components are found as plugins. Nutch includes these out-of-the-box indexers: + + ||'''Indexer''' ||'''Description''' || + ||indexer-solr ||Indexer for a Solr server || + ||indexer-rabbit ||Indexer for a RabbitMQ server || + ||indexer-dummy ||Indexer usually used for debugging, it writes in a plain text file || + ||indexer-elastic ||Indexer for an Elasticsearch server || + ||indexer-elastic-rest ||Indexer for Elasticsearch, but using [[https://github.com/searchbox-io/Jest|Jest]] to connect with the REST API provided by Elasticsearch || + ||indexer-cloudsearch ||Indexer for Amazon <<GetText(CloudSearch)>> || + - == Structure of index-writers.xml == + = Structure of index-writers.xml = + + The configuration for the indexers is in the index-writers.xml file, included in the official Nutch distribution. The structure of this file is quite simple and consists mainly of a list of indexers (`<writer>` element): + + {{{#!highlight xml + <writers> + <writer id="<writer_id>" class="<implementation_class>"> + <mapping> + ... + </mapping> + <parameters> + ... + </parameters> + </writer> + ... + </writers> + }}} + + Each `<writer>` element has two mandatory attributes: + + 1. `<writer_id>` is a unique identification for each configuration. This feature allows Nutch to distinguish each configuration, even when they are for the same index writer. In addition, it allows to have multiple instances for the same index writer, but with different configurations. + 1. `<implementation_class>` corresponds to the canonical name of the class that implements the [[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/indexer/IndexWriter.html|IndexWriter]] extension point. For the indexers provided by Nutch out-of-the-box the possible values of `<implementation_class>` are: + + ||'''Indexer''' ||'''Implementation class''' || + ||indexer-solr ||[[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/indexwriter/solr/SolrIndexWriter.html|org.apache.nutch.indexwriter.solr.SolrIndexWriter]] || + ||indexer-rabbit ||[[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.html|org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter]] || + ||indexer-dummy ||[[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.html|org.apache.nutch.indexwriter.dummy.DummyIndexWriter]] || + ||indexer-elastic ||[[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.html|org.apache.nutch.indexwriter.elastic.ElasticIndexWriter]] || + ||indexer-elastic-rest ||[[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.html|org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter]] || + ||indexer-cloudsearch ||[[https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/indexwriter/cloudsearch/CloudSearchIndexWriter.html|org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter]] || + + Each `<writer>` element contains two child elements: `<mapping>` and `<parameters>` == Mapping section == + The `<mapping>` element is independent for each configuration and is where you configure the modifications that will be applied to each document before it is sent to its final destination. The `<mapping>` element contains 3 child elements: `<copy>`, `<rename>` and `<remove>` + + * `<copy>` indicates which fields should be copied from the document and to which field they should be copied. Each child element of `<copy>` element, has this form: `<field source="<source>" dest="<destination>"/>` + * `<source>` indicates the field's name to be copied. + * `<destination>` indicates to which field or fields should be copied. The value of this attribute can be a comma separated list. In this case, the value of '''source''' attribute will be copied into each field in the list. For example: if the configuration is `<field source="title" dest="description,search"/>`, the value of the '''title''' field will be copied for the '''description''' and '''search''' fields. + * `<rename>` indicates which fields of the document should be renamed. Each child element of `<rename>` element, has this form: `<field source="<source>" dest="<destination>"/>` + * `<source>` indicates the field's name to be renamed. + * `<destination>` indicates the new name of the field. For example: if the configuration is `<field source="metatag.description" dest="description"/>`, the field '''metatag.description''' will be renamed as '''description'''. + * `<remove>` indicates which fields of the document should be removed. Each child element of `<remove>` element, has the form: `<field source="<source>"/>` + * `<source>` indicates the field's name to be remove. + == Parameters section == + + The `<parameters>` element is independent for each configuration and is where the parameters that the indexer needs are specified. Each parameter has the form `<param name="<name> "value="<value>"/>` and the values it can take depend on the indexer that you want to configure. Below is a description of the arguments of each indexer provided by Nutch out-of-the-box individually. === Solr indexer properties ===

