Peter Ciuffetti created NUTCH-2058:
--------------------------------------

             Summary: Indexer plugin that allows RegEx replacements on the 
NutchDocument field values
                 Key: NUTCH-2058
                 URL: https://issues.apache.org/jira/browse/NUTCH-2058
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
            Reporter: Peter Ciuffetti


This is the description of a IndexingFilter plugin I'm developing that allows 
regex replacements on field values prior to indexing to your search engine.

*Plugin name*: index-replace

*Property name*: index.replace.regexp

*Use case example:*

I'm indexing Nutch-created documents to a pre-existing SOLR core.  In this case 
I need to coerce the documents into the schema and field formats expected by 
the existing core.  The features of index-static and solrindex-mapping.xml get 
me most of the way.  Among other things, I need to generate identifiers from 
the web URLs.  So I need to do something like a regex replace on the id 
provided and then (with solrindex-mapping.xml) move this to the field name 
defined by the existing core.

Another use case might be to refactor all URLs stored in the document so they 
route through a redirector gateway.

The following is from the draft description in nutch-default.xml

*Description:*
Allows indexing-time regexp replace manipulation of metadata fields. The format 
of the property is a list of regexp replacements, one line per field being 
modified.  To use this property, add index-replace to your list of activated 
plugins.
    
*Example:*
{code:xml}
<property>
  <name>index.replace.regexp</name>
  <value>
        fldname1=/regexp/replacement/flags
        fldname2=/regexp/replacement/flags
  </value>
</property>
{code}

Field names would be one of those from 
https://wiki.apache.org/nutch/IndexStructure. The replacements will happen in 
the order listed. If a field needs multiple replacement operations they may be 
listed more than once.

The *field name* precedes the equal sign.  The first character after the equal 
sign signifies the delimiter for the regexp, the replacement value and the 
flags.

The *regexp* and the optional *flags* should correspond to 
Pattern.compile(String regexp, int flags) defined here: 
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile%28java.lang.String,%20int%29

The *flags* is an integer sum of the flag values defined in 
http://docs.oracle.com/javase/7/docs/api/constant-values.html (Sec: 
java.util.regex.Pattern)

Patterns are compiled when the plugin is initialized for efficiency.

*Escaping*: since the regexp is being read from a config file, any escaped 
values must be double escaped.  Eg:  {code}
  id=/\\s+//
{code} will cause the escaped \s+ match pattern to be used.

The *replacement* value should correspond to Java Matcher(CharSequence 
input).replaceAll(String replacement):  
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceAll%28java.lang.String%29
    
*Multi-valued Fields*
If a field has multiple values, the replacement will be applied to each value 
in turn.

*Non-string Datatypes*
Replacement is possible only on String field datatypes.  If the field you name 
in the property is not a String datatype, it will be silently ignored.

*Host and URL specific replacements*
If the replacements should apply only to specifc pages, then add a sequence like

{code}
    hostmatch=hostmatchpattern
    fld1=/regexp/replace/flags
    fld2=/regexp/replace/flags
{code}
    or
{code}
    urlmatch=urlmatchpattern
    fld1=/regexp/replace/flags
    fld2=/regexp/replace/flags
{code}

When using Host and URL replacements, all replacements preceding the first 
hostmatch or urlmatch will apply to all Nutch documents.  Replacements 
following a hostmatch or urlmatch will be applied to Nutch documents that match 
the host or url field (up to the next hostmatch or urlmatch line).  hostmatch 
and urlmatch patterns must be unique in this property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to