[ 
https://issues.apache.org/jira/browse/NUTCH-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613677#comment-14613677
 ] 

ASF GitHub Bot commented on NUTCH-2058:
---------------------------------------

GitHub user PeterCiuffetti opened a pull request:

    https://github.com/apache/nutch/pull/44

    Nutch 2058 - New index-replace plugin that allows regexp field value 
replacements

    Modifies the NutchDocument during the IndexingFilter phase to do regexp 
replacements on specified fields.
    
    See https://issues.apache.org/jira/browse/NUTCH-2058

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/PeterCiuffetti/nutch NUTCH-2058

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/44.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #44
    
----
commit dc32ce6dd66b4e712b1e9693a4e726febbc8171e
Author: PeterCiuffetti <[email protected]>
Date:   2015-07-01T13:31:03Z

    Initial checkin got parse-replace

commit 2eebd285232bd0595bf321add1d35ae1a60e7d07
Author: PeterCiuffetti <[email protected]>
Date:   2015-07-01T13:31:11Z

    Merge branch 'trunk' of github.com:apache/nutch into parse-replace

commit a2c1851e096bfd528b722778671490d4fd610a4b
Author: PeterCiuffetti <[email protected]>
Date:   2015-07-02T14:27:19Z

    Refactored from a parse filter to an index filter

commit 57748e0de2e7fc60d349462144c3ed7703ac0957
Author: PeterCiuffetti <[email protected]>
Date:   2015-07-04T09:22:02Z

    Updated tests. Feature set complete

commit e80e7b1e59a0025a1e5ed266e06546e97b7c2770
Author: PeterCiuffetti <[email protected]>
Date:   2015-07-04T09:23:23Z

    Merge branch 'trunk' of github.com:apache/nutch into NUTCH-2058

commit 81368fe08193a365a6ca6f2179eb46e96ef0f7c5
Author: PeterCiuffetti <[email protected]>
Date:   2015-07-04T09:34:18Z

    README doc change

commit d2d534c1a9a48dd7a29147453f4c4e1fc78f11fb
Author: PeterCiuffetti <[email protected]>
Date:   2015-07-04T10:17:27Z

    Updated documentation

commit 0455d9119b694ccb9274a43dba392b76771a9da1
Author: PeterCiuffetti <[email protected]>
Date:   2015-07-04T10:23:19Z

    Undoing build.xml change

----


> Indexer plugin that allows RegEx replacements on the NutchDocument field 
> values
> -------------------------------------------------------------------------------
>
>                 Key: NUTCH-2058
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2058
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Peter Ciuffetti
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> This is the description of a IndexingFilter plugin I'm developing that allows 
> regex replacements on field values prior to indexing to your search engine.
> *Plugin name*: index-replace
> *Property name*: index.replace.regexp
> *Use case example:*
> I'm indexing Nutch-created documents to a pre-existing SOLR core.  In this 
> case I need to coerce the documents into the schema and field formats 
> expected by the existing core.  The features of index-static and 
> solrindex-mapping.xml get me most of the way.  Among other things, I need to 
> generate identifiers from the web URLs.  So I need to do something like a 
> regex replace on the id provided and then (with solrindex-mapping.xml) move 
> this to the field name defined by the existing core.
> Another use case might be to refactor all URLs stored in the document so they 
> route through a redirector gateway.
> The following is from the draft description in nutch-default.xml
> *Description:*
> Allows indexing-time regexp replace manipulation of metadata fields. The 
> format of the property is a list of regexp replacements, one line per field 
> being modified.  To use this property, add index-replace to your list of 
> activated plugins.
>     
> *Example:*
> {code:xml}
> <property>
>   <name>index.replace.regexp</name>
>   <value>
>         fldname1=/regexp/replacement/flags
>         fldname2=/regexp/replacement/flags
>   </value>
> </property>
> {code}
> Field names would be one of those from 
> https://wiki.apache.org/nutch/IndexStructure. The replacements will happen in 
> the order listed. If a field needs multiple replacement operations they may 
> be listed more than once.
> The *field name* precedes the equal sign.  The first character after the 
> equal sign signifies the delimiter for the regexp, the replacement value and 
> the flags.
> The *regexp* and the optional *flags* should correspond to 
> Pattern.compile(String regexp, int flags) defined here: 
> http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile%28java.lang.String,%20int%29
> The *flags* is an integer sum of the flag values defined in 
> http://docs.oracle.com/javase/7/docs/api/constant-values.html (Sec: 
> java.util.regex.Pattern)
> Patterns are compiled when the plugin is initialized for efficiency.
> *Escaping*: since the regexp is being read from a config file, any escaped 
> values must be double escaped.  Eg:  {code}
>   id=/\\s+//
> {code} will cause the escaped \s+ match pattern to be used.
> The *replacement* value should correspond to Java Matcher(CharSequence 
> input).replaceAll(String replacement):  
> http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceAll%28java.lang.String%29
>     
> *Multi-valued Fields*
> If a field has multiple values, the replacement will be applied to each value 
> in turn.
> *Non-string Datatypes*
> Replacement is possible only on String field datatypes.  If the field you 
> name in the property is not a String datatype, it will be silently ignored.
> *Host and URL specific replacements*
> If the replacements should apply only to specifc pages, then add a sequence 
> like
> {code}
>     hostmatch=hostmatchpattern
>     fld1=/regexp/replace/flags
>     fld2=/regexp/replace/flags
> {code}
>     or
> {code}
>     urlmatch=urlmatchpattern
>     fld1=/regexp/replace/flags
>     fld2=/regexp/replace/flags
> {code}
> When using Host and URL replacements, all replacements preceding the first 
> hostmatch or urlmatch will apply to all Nutch documents.  Replacements 
> following a hostmatch or urlmatch will be applied to Nutch documents that 
> match the host or url field (up to the next hostmatch or urlmatch line).  
> hostmatch and urlmatch patterns must be unique in this property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to