Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "IndexReplace" page has been changed by PeterCiuffetti: https://wiki.apache.org/nutch/IndexReplace New page: = Index Replace = The '''index-replace''' plugin is an indexing filter that allows regexp replace manipulation of metadata fields. The use cases would include adjusting the Nutch document field set and structure to conform to a field set used by a target core that was different than the default fieldset used by Nutch. With this plugin you can modify the structure of existing fields and copy modified fields into a new fields. It allows these replacements to be done globally for all parsed pages and for modifications to be done only for certain host or URL patterns. Related plugins include [[https://issues.apache.org/jira/browse/NUTCH-940|index-static]] which allows you to add one or more fields with static values. Also the `indexer-solr` plugin has a config file `solrindex-mapping.xml` which allows you to rename and copy fields. The '''index-replace''' plugin allows you to make modifications to the fields. == Configuration Example == In `conf/nutch-site.xml` add something like: {{{ <property> <name>index.replace.regexp</name> <value> id=/file\:/http\:my.site.com/ url=/file\:/http\:my.site.com/2 </value> </property> }}} Also insure that `index-replace` is among the plugins that will be used. {{{ <property> <name>plugin.includes</name> <value>...|index-(basic|anchor|metadata|static|replace)|...</value> </property> }}} == Property format == Name: `index.replace.regexp` The format of the property is a list of regexp replacements, one line per field being modified. Field names would be one of those from [[IndexStructure|IndexStructure]]. The field name precedes the equal sign. The first character after the equal sign signifies the delimiter for the regexp, the replacement value and the optional flags. == Replacement Sequence == The replacements will happen in the order listed. If a field needs multiple replacement operations it may be listed more than once. == RegExp Format == The regexp and the optional flags should correspond to Java's [[http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile%28java.lang.String,%20int%29|Pattern.compile]]. Patterns are compiled when the plugin is initialized for efficiency. == Replacement Format == The replacement value should correspond to Java [[http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceAll%28java.lang.String%29|Matcher]] == Flags == The flags is an integer sum of the flag values defined in Java [[http://docs.oracle.com/javase/7/docs/api/constant-values.html|constant values]] (Sec: java.util.regex.Pattern) == Creating New Fields == If you express the fieldname as `fldname1:fldname2=[replacement]`, then the replacer will create a new field (fldname2) from the source field (fldname1). The source field remains unmodified. This is an alternative to `solrindex-mapping.xml` which is only able to copy fields verbatim. == Multi-valued Fields == If a field has multiple values, the replacement will be applied to each value in turn. == Non-string Datatypes == Replacement is possible only on `String` field datatypes. If the field you name in the property is not a `String` datatype, it will be silently ignored. == Host and URL specific replacements == If the replacements should apply only to specific pages, then add a sequence like {{{ hostmatch=hostmatchpattern fld1=/regexp/replace/flags fld2=/regexp/replace/flags }}} or {{{ urlmatch=urlmatchpattern fld1=/regexp/replace/flags fld2=/regexp/replace/flags }}} When using Host and URL replacements, all replacements preceding the first `hostmatch=` or `urlmatch=` will apply to all parsed pages. Replacements following a `hostmatch` or `urlmatch` will be applied to pages which match the host or url field (up to the next `hostmatch` or `urlmatch` line). `hostmatch` and `urlmatch` patterns must be unique in this property. == Plugin order == In most cases you will want this plugin to run last among the index filters, just before you run your indexer plugin. == Testing your match patterns == [[http://www.regexplanet.com/advanced/java/index.html|Online Regexp testers]] can help get the basics of your pattern working. If your property does not parse correctly, you can discover this by looking in the `hadoop.log` after doing a trial indexing run. Its important to test your patterns because the `index-replace` plugin will mark any entry in the replacement list as ''invalid'' which does not parse into a proper regexp operation. Invalid replacement operations are simply ignored. === To test in Nutch === * Prepare a test HTML file with the field contents you want to test. * Place this in a directory accessible to nutch. * Use the file:/// syntax to list the test file(s) in a test/urls seed list. * See the nutch faq [[https://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F|index my local file system]] for conf settings you will need. (Note the `urlmatch=` and `hostmatch=` patterns in your configuration may not conform to your test file url; This test approach confirms only how your global matches behave, unless your `urlmatch=` and `hostmatch=` patterns also match the file: URL pattern for your test file) Run.. {{{ bin/nutch inject crawl/crawldb test bin/nutch generate crawl/crawldb crawl/segments bin/nutch fetch crawl/segments/[segment] bin/nutch parse crawl/segments/[segment] bin/nutch invertlinks crawl/linkdb -dir crawl/segments ...index your document, for example with SOLR... bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segement[segment] -filter -normalize }}} Inspect `hadoop.log` for info about pattern parsing and compilation.. {{{ grep replace logs/hadoop.log }}} To inspect your index with the solr admin panel browse to... {{{ http://localhost:8983/solr/#/ }}} And if you want to adjust your patterns in `nutch-site.xml` and re-test, you only need to repeat the solrindex step above and review the result.

