Jack Krupansky created SOLR-4864:
------------------------------------

             Summary: RegexReplaceProcessorFactory should support pattern 
capture group substitution in replacement string
                 Key: SOLR-4864
                 URL: https://issues.apache.org/jira/browse/SOLR-4864
             Project: Solr
          Issue Type: Improvement
          Components: update
    Affects Versions: 4.3
            Reporter: Jack Krupansky


It is unfortunate the the replacement string for RegexReplaceProcessorFactory 
is a pure, "quoted" (escaped) literal and does not support pattern capture 
group substitution. This processor should be enhanced to support full, standard 
pattern capture group substitution.

The test case I used:

{code}
  <updateRequestProcessorChain name="regex-mark-special-words">
    <processor class="solr.RegexReplaceProcessorFactory">
      <str name="fieldRegex">.*</str>
      <str name="pattern">([^a-zA-Z]|^)(cat|dog|fox)([^a-zA-Z]|$)</str>
      <str name="replacement">$1&lt;&lt;$2&gt;&gt;$3</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>
{code}

Indexing with this command against the standard Solr example with the above 
addition to solrconfig:

{code}
  curl 
"http://localhost:8983/solr/update?commit=true&update.chain=regex-mark-special-words";
 \
  -H 'Content-type:application/json' -d '
  [{"id": "doc-1",
    "title": "Hello World",
    "content": "The cat and the dog jumped over the fox.",
    "other_ss": ["cat","cat bird", "lazy dog", "red fox den"]}]'
{code}

Alas, the resulting document consists of:

{code}
  "id":"doc-1",
  "title":["Hello World"],
  "content":["The$1<<$2>>$3and the$1<<$2>>$3jumped over the$1<<$2>>$3"],
  "other_ss":["$1<<$2>>$3",
    "$1<<$2>>$3bird",
    "lazy$1<<$2>>$3",
    "red$1<<$2>>$3den"],
{code}

The Javadoc for RegexReplaceProcessorFactory uses the exact same terminology of 
 "replacement string", as does Java's Matcher.replaceAll, but clearly the 
semantics are distinct, with replaceAll supporting pattern capture group 
substitution for its "replacement string", while RegexReplaceProcessorFactory 
interprets "replacement string" as being a literal. At a minimum, the 
RegexReplaceProcessorFactory Javadoc should explicitly state that the string is 
a literal that does not support pattern capture group substitution.

The relevant code in RegexReplaceProcessorFactory#init:

{code}
replacement = Matcher.quoteReplacement(replacementParam.toString());
{code}

Possible options for the enhancement:

1. Simply skip the quoteReplacement and fully support pattern capture group 
substitution with no additional changes. Does have a minor backcompat issue.

2. Add an alternative to "replacement", say "nonQuotedReplacement" that is not 
quoted as "replacement" is.

3. Add an option, say "quotedReplacement" that defaults to "true" for 
backcompat, but can be set to "false" to support full replaceAll pattern 
capture group substitution.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to