[
https://issues.apache.org/jira/browse/NUTCH-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma reopened NUTCH-2058:
----------------------------------
Reopening due to failing unit tests:
------------- ---------------- ---------------
------------- Standard Error -----------------
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/home/markus/projects/apache/nutch/trunk/build/test/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/home/markus/projects/apache/nutch/trunk/build/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
------------- ---------------- ---------------
Testcase: testGlobalAndUrlNotMatchesPattern took 1.345 sec
Testcase: testGlobalReplacement took 0.202 sec
Testcase: testReplacementsWithFlags took 0.177 sec
Testcase: testUrlMatchesPattern took 0.144 sec
Testcase: testReplacementsDifferentTarget took 0.166 sec
Testcase: testReplacementsRunInSpecifedOrder took 0.085 sec
Testcase: testInvalidPatterns took 0.128 sec
FAILED
expected:<With this []plugin, I control th...> but was:<With this [awesome
]plugin, I control th...>
junit.framework.AssertionFailedError: expected:<With this []plugin, I control
th...> but was:<With this [awesome ]plugin, I control th...>
at
org.apache.nutch.indexer.replace.TestIndexReplace.testInvalidPatterns(TestIndexReplace.java:203)
Testcase: testGlobalAndUrlMatchesPattern took 0.096 sec
Testcase: testUrlNotMatchesPattern took 0.074 sec
Testcase: testPropertyParse took 0.039 sec
> Indexer plugin that allows RegEx replacements on the NutchDocument field
> values
> -------------------------------------------------------------------------------
>
> Key: NUTCH-2058
> URL: https://issues.apache.org/jira/browse/NUTCH-2058
> Project: Nutch
> Issue Type: Improvement
> Components: indexer, parser
> Reporter: Peter Ciuffetti
> Assignee: Chris A. Mattmann
> Fix For: 1.11
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> This is the description of a IndexingFilter plugin I'm developing that allows
> regex replacements on field values prior to indexing to your search engine.
> *Plugin name*: index-replace
> *Property name*: index.replace.regexp
> *Use case example:*
> I'm indexing Nutch-created documents to a pre-existing SOLR core. In this
> case I need to coerce the documents into the schema and field formats
> expected by the existing core. The features of index-static and
> solrindex-mapping.xml get me most of the way. Among other things, I need to
> generate identifiers from the web URLs. So I need to do something like a
> regex replace on the id provided and then (with solrindex-mapping.xml) move
> this to the field name defined by the existing core.
> Another use case might be to refactor all URLs stored in the document so they
> route through a redirector gateway.
> The following is from the draft description in nutch-default.xml
> *Description:*
> Allows indexing-time regexp replace manipulation of metadata fields. The
> format of the property is a list of regexp replacements, one line per field
> being modified. To use this property, add index-replace to your list of
> activated plugins.
>
> *Example:*
> {code:xml}
> <property>
> <name>index.replace.regexp</name>
> <value>
> fldname1=/regexp/replacement/flags
> fldname2=/regexp/replacement/flags
> </value>
> </property>
> {code}
> Field names would be one of those from
> https://wiki.apache.org/nutch/IndexStructure. The replacements will happen in
> the order listed. If a field needs multiple replacement operations they may
> be listed more than once.
> The *field name* precedes the equal sign. The first character after the
> equal sign signifies the delimiter for the regexp, the replacement value and
> the flags.
> The *regexp* and the optional *flags* should correspond to
> Pattern.compile(String regexp, int flags) defined here:
> http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile%28java.lang.String,%20int%29
> The *flags* is an integer sum of the flag values defined in
> http://docs.oracle.com/javase/7/docs/api/constant-values.html (Sec:
> java.util.regex.Pattern)
> Patterns are compiled when the plugin is initialized for efficiency.
> *Escaping*: since the regexp is being read from a config file, any escaped
> values must be double escaped. Eg: {code}
> id=/\\s+//
> {code} will cause the escaped \s+ match pattern to be used.
> The *replacement* value should correspond to Java Matcher(CharSequence
> input).replaceAll(String replacement):
> http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceAll%28java.lang.String%29
>
> *Multi-valued Fields*
> If a field has multiple values, the replacement will be applied to each value
> in turn.
> *Non-string Datatypes*
> Replacement is possible only on String field datatypes. If the field you
> name in the property is not a String datatype, it will be silently ignored.
> *Host and URL specific replacements*
> If the replacements should apply only to specifc pages, then add a sequence
> like
> {code}
> hostmatch=hostmatchpattern
> fld1=/regexp/replace/flags
> fld2=/regexp/replace/flags
> {code}
> or
> {code}
> urlmatch=urlmatchpattern
> fld1=/regexp/replace/flags
> fld2=/regexp/replace/flags
> {code}
> When using Host and URL replacements, all replacements preceding the first
> hostmatch or urlmatch will apply to all Nutch documents. Replacements
> following a hostmatch or urlmatch will be applied to Nutch documents that
> match the host or url field (up to the next hostmatch or urlmatch line).
> hostmatch and urlmatch patterns must be unique in this property.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)