[ 
https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654773#comment-16654773
 ] 

ASF GitHub Bot commented on NUTCH-2658:
---------------------------------------

sebastian-nagel commented on a change in pull request #398: NUTCH-2658 Add 
README for the index-links plugin
URL: https://github.com/apache/nutch/pull/398#discussion_r226201417
 
 

 ##########
 File path: src/plugin/index-links/README.md
 ##########
 @@ -0,0 +1,53 @@
+indexer-links plugin for Nutch
+==============================
+
+This plugin provides the feature to index the inlinks and outlinks of a URL
+into an indexing backend.
+
+## Configuration
+
+This plugin provides the following configuration options:
+
+* `index.links.outlinks.host.ignore`: If true, the plugin will ignore outlinks
+that point to the same host as the current URL. By default, all outlinks are
+indexed. If `db.ignore.internal.links` is `true` (default value) this setting
+is ignored because the internal links are already ignored.
+
+* `index.links.inlinks.host.ignore`: If true, the plugin will ignore inlinks
+coming from the same host as the current URL. By default, all inlinks are
+indexed. If `db.ignore.internal.links` is `true` (default value) this setting
+is ignored because the internal links are already ignored.
+
+* `index.links.hosts.only`: If true, the plugin will index only the host 
portion of the inlinks/outlinks URLs.
+
+## Fields
+
+For this plugin to work 2 new fields have to be added/configured in your 
storage backend:
+
+* `inlinks`
+* `outlinks`
+
+If the plugin is enabled these fields have to be added to your storage backend
+configuration.
+
+The specifics of how these fields are configured depends on your specific
+backend. We provide here sane default values for Solr.
+
+The following fields should be added to your backend storage. We provide
+examples of default values for the Solr schema.
+
+* Each outlink/inlink will be stored as a string without any tokenization.
+* The `inlink`/`outlink` fields have to be multivalued, because normally a
+given URL will have multiple inlinks and outlinks.
+
+```
+<fieldType name="string" class="solr.StrField" sortMissingLast="true" 
omitNorms="true"/>
+```
+
+The field configuration could look like:
+
+```
+<field name="inlinks" type="multiValuedString" stored="true" indexed="true" 
multiValued="true"/>
 
 Review comment:
   The Solr schema 
([conf/schema.xml](/apache/nutch/blob/master/conf/schema.xml)) already contains 
the field definitions for multiple IndexingFilter plugins. Why not add inlinks 
and outlinks also to the schema?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add README file to all plugins in src/plugin
> --------------------------------------------
>
>                 Key: NUTCH-2658
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2658
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation, plugin
>            Reporter: Jorge Luis Betancourt Gonzalez
>            Priority: Trivial
>
> Since we've migrated a good portion of our workflow to Github we could 
> consider adding a {{README.md}} file to the root of each plugin in 
> {{src/plugins}}. 
> This is a good place to have plugin-specific documentation. Wich fields the 
> plugin adds to the indexer, which configuration options, etc. Also, since the 
> README.md is rendered by Github automatically is a good link to point users.
> I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
> that it's a good source of information to point users when asking questions 
> regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to