[
https://issues.apache.org/jira/browse/NUTCH-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613677#comment-14613677
]
ASF GitHub Bot commented on NUTCH-2058:
---------------------------------------
GitHub user PeterCiuffetti opened a pull request:
https://github.com/apache/nutch/pull/44
Nutch 2058 - New index-replace plugin that allows regexp field value
replacements
Modifies the NutchDocument during the IndexingFilter phase to do regexp
replacements on specified fields.
See https://issues.apache.org/jira/browse/NUTCH-2058
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/PeterCiuffetti/nutch NUTCH-2058
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nutch/pull/44.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #44
----
commit dc32ce6dd66b4e712b1e9693a4e726febbc8171e
Author: PeterCiuffetti <[email protected]>
Date: 2015-07-01T13:31:03Z
Initial checkin got parse-replace
commit 2eebd285232bd0595bf321add1d35ae1a60e7d07
Author: PeterCiuffetti <[email protected]>
Date: 2015-07-01T13:31:11Z
Merge branch 'trunk' of github.com:apache/nutch into parse-replace
commit a2c1851e096bfd528b722778671490d4fd610a4b
Author: PeterCiuffetti <[email protected]>
Date: 2015-07-02T14:27:19Z
Refactored from a parse filter to an index filter
commit 57748e0de2e7fc60d349462144c3ed7703ac0957
Author: PeterCiuffetti <[email protected]>
Date: 2015-07-04T09:22:02Z
Updated tests. Feature set complete
commit e80e7b1e59a0025a1e5ed266e06546e97b7c2770
Author: PeterCiuffetti <[email protected]>
Date: 2015-07-04T09:23:23Z
Merge branch 'trunk' of github.com:apache/nutch into NUTCH-2058
commit 81368fe08193a365a6ca6f2179eb46e96ef0f7c5
Author: PeterCiuffetti <[email protected]>
Date: 2015-07-04T09:34:18Z
README doc change
commit d2d534c1a9a48dd7a29147453f4c4e1fc78f11fb
Author: PeterCiuffetti <[email protected]>
Date: 2015-07-04T10:17:27Z
Updated documentation
commit 0455d9119b694ccb9274a43dba392b76771a9da1
Author: PeterCiuffetti <[email protected]>
Date: 2015-07-04T10:23:19Z
Undoing build.xml change
----
> Indexer plugin that allows RegEx replacements on the NutchDocument field
> values
> -------------------------------------------------------------------------------
>
> Key: NUTCH-2058
> URL: https://issues.apache.org/jira/browse/NUTCH-2058
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Reporter: Peter Ciuffetti
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> This is the description of a IndexingFilter plugin I'm developing that allows
> regex replacements on field values prior to indexing to your search engine.
> *Plugin name*: index-replace
> *Property name*: index.replace.regexp
> *Use case example:*
> I'm indexing Nutch-created documents to a pre-existing SOLR core. In this
> case I need to coerce the documents into the schema and field formats
> expected by the existing core. The features of index-static and
> solrindex-mapping.xml get me most of the way. Among other things, I need to
> generate identifiers from the web URLs. So I need to do something like a
> regex replace on the id provided and then (with solrindex-mapping.xml) move
> this to the field name defined by the existing core.
> Another use case might be to refactor all URLs stored in the document so they
> route through a redirector gateway.
> The following is from the draft description in nutch-default.xml
> *Description:*
> Allows indexing-time regexp replace manipulation of metadata fields. The
> format of the property is a list of regexp replacements, one line per field
> being modified. To use this property, add index-replace to your list of
> activated plugins.
>
> *Example:*
> {code:xml}
> <property>
> <name>index.replace.regexp</name>
> <value>
> fldname1=/regexp/replacement/flags
> fldname2=/regexp/replacement/flags
> </value>
> </property>
> {code}
> Field names would be one of those from
> https://wiki.apache.org/nutch/IndexStructure. The replacements will happen in
> the order listed. If a field needs multiple replacement operations they may
> be listed more than once.
> The *field name* precedes the equal sign. The first character after the
> equal sign signifies the delimiter for the regexp, the replacement value and
> the flags.
> The *regexp* and the optional *flags* should correspond to
> Pattern.compile(String regexp, int flags) defined here:
> http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile%28java.lang.String,%20int%29
> The *flags* is an integer sum of the flag values defined in
> http://docs.oracle.com/javase/7/docs/api/constant-values.html (Sec:
> java.util.regex.Pattern)
> Patterns are compiled when the plugin is initialized for efficiency.
> *Escaping*: since the regexp is being read from a config file, any escaped
> values must be double escaped. Eg: {code}
> id=/\\s+//
> {code} will cause the escaped \s+ match pattern to be used.
> The *replacement* value should correspond to Java Matcher(CharSequence
> input).replaceAll(String replacement):
> http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceAll%28java.lang.String%29
>
> *Multi-valued Fields*
> If a field has multiple values, the replacement will be applied to each value
> in turn.
> *Non-string Datatypes*
> Replacement is possible only on String field datatypes. If the field you
> name in the property is not a String datatype, it will be silently ignored.
> *Host and URL specific replacements*
> If the replacements should apply only to specifc pages, then add a sequence
> like
> {code}
> hostmatch=hostmatchpattern
> fld1=/regexp/replace/flags
> fld2=/regexp/replace/flags
> {code}
> or
> {code}
> urlmatch=urlmatchpattern
> fld1=/regexp/replace/flags
> fld2=/regexp/replace/flags
> {code}
> When using Host and URL replacements, all replacements preceding the first
> hostmatch or urlmatch will apply to all Nutch documents. Replacements
> following a hostmatch or urlmatch will be applied to Nutch documents that
> match the host or url field (up to the next hostmatch or urlmatch line).
> hostmatch and urlmatch patterns must be unique in this property.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)