[jira] [Commented] (SOLR-13242) RegexReplaceProcessorFactory not making accurate replacement

Gus Heck (JIRA) Mon, 12 Aug 2019 06:32:43 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905193#comment-16905193
 ]


Gus Heck commented on SOLR-13242:
---------------------------------

RegexReplaceUpdateProcessorFactory is a very simple class. As noted above it's 
just a wrapper around Matcher.replaceAll() 
{code:java}
    return valueMutator(getSelector(), next, src -> {
      if (src instanceof CharSequence) {
        CharSequence txt = (CharSequence) src;
        return pattern.matcher(txt).replaceAll(replacement);
      }
      return src;
    });
{code}
I notice this issue is dealing with patterns involving newlines and whitespace. 
Whitespace can be a hairy and complicated thing, since it's not rendered 
visibly unless your editor either fails to recognize it and prints ?'s... so 
the better, more complete your editor, the less likely you are to see the 
problem :). You often think you know what you're looking at but don't. It seems 
possible that you are running into unusual white space chars that break your 
pattern. One can see how fiddly this gets from things seen in Pattern.java:

 
||Predefined character classes||
|{{.}}|Any character (may or may not match [line 
terminators|https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lt])|
|{{\d}}|A digit: {{[0-9]}}|
|{{\D}}|A non-digit: {{[^0-9]}}|
|{{\h}}|A horizontal whitespace character: {{[ 
\t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]}}|
|{{\H}}|A non-horizontal whitespace character: {{[^\h]}}|
|{{\s}}|A whitespace character: {{[ \t\n\x0B\f\r]}}|
|{{\S}}|A non-whitespace character: {{[^\s]}}|
|{{\v}}|A vertical whitespace character: {{[\n\x0B\f\r\x85\u2028\u2029]}}|
|{{\V}}|A non-vertical whitespace character: {{[^\v]}}|
|{{\w}}|A word character: {{[a-zA-Z_0-9]}}|
|{{\W}}|A non-word character: {{[^\w]}}|

(\h|\v) appears to catch a lot more types of whitespace than \s for example. 
See also (with (?U)which turns on the unicode character class feature).
|{{\p\{Blank}}}|A space or a tab: 
{{[\p\{IsWhite_Space}&&[^\p\{gc=Zl}\p\{gc=Zp}\x0a\x0b\x0c\x0d\x85]]}}|

The comments above indicating that \W and \h instead of \s work better for the 
OP seem to support the hypothesis that the issue was caused by was oddball 
white space in the data, not a bug in Solr. [~edwinyeozl] can we resolve "this 
as not a problem"?

> RegexReplaceProcessorFactory not making accurate replacement
> ------------------------------------------------------------
>
>                 Key: SOLR-13242
>                 URL: https://issues.apache.org/jira/browse/SOLR-13242
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 7.6, 7.7, 7.7.1
>            Reporter: Edwin Yeo Zheng Lin
>            Priority: Major
>              Labels: regex, solr
>
> We are using the RegexReplaceProcessorFactory, and have tried with all of the 
> following configurations in solrconfig.xml:
>  
> <processor class="solr.RegexReplaceProcessorFactory">
>     <str name="fieldName">content</str>
>     <str name="pattern">(\s*\r?\n)\{2,}</str>
>     <str name="replacement"><br><br></str>
>     <bool name="literalReplacement">true</bool>
>   </processor>
> <processor class="solr.RegexReplaceProcessorFactory">
>     <str name="fieldName">content</str>
>     <str name="pattern">([ \s]*\r?\n)\{2,}</str>
>     <str name="replacement"><br><br></str>
>     <bool name="literalReplacement">true</bool>
>   </processor>
>  <processor class="solr.RegexReplaceProcessorFactory">
>     <str name="fieldName">content</str>
>     <str name="pattern">(\s*\n)\{2,}</str>
>     <str name="replacement"><br><br></str>
>     <bool name="literalReplacement">true</bool>
>   </processor>
>  <processor class="solr.RegexReplaceProcessorFactory">
>     <str name="fieldName">content</str>
>     <str name="pattern">(\n\s*)\{2,}</str>
>     <str name="replacement"><br><br></str>
>     <bool name="literalReplacement">true</bool>
>   </processor>
>  
> The regex pattern of (\s*\r?\n)\{2,}, ([ \s]*\r?\n)\{2,}, (\s*\n)\{2,} and 
> (\n\s*)\{2,} are working perfectly in [regex101.com|http://regex101.com/], in 
> which all the \n will be replaced by only two <br>
> However, in Solr, there are cases (in Example 2 and 3 below) that has four 
> <br> in a row. This should not be the case, as we have already set it to 
> replace by two <br> regardless of how many \n are there in a row.
>  
>  
> *Example 1: The sentence that the above regex pattern is working correctly* 
> *Original content in EML [file:*|file://%2A/]  
> Dear Sir, 
>  
> I am terminating 
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content:*     Dear Sir,  <br><br>I am terminating 
>  
> *Example 2: The sentence that the above regex pattern is partially working 
> (as you can see, instead of 2 <br>, there are 4 <br>)*
> *Original content in EML [file:*|file://%2A/]    
> _exalted_
> _Psalm 89:17_
>  
> 3 Choa Chu Kang Avenue 4    
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa Chu 
> Kang Avenue 4, Singapore
> *Index content:* exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa Chu 
> Kang Avenue 4, Singapore
>  
> *Example 3: The sentence that the above regex pattern is partially working 
> (as you can see, instead of 2 <br>, there are 4 <br>)*
> *Original content in EML [file:*|file://%2A/]    
> [http://www.concordpri.moe.edu.sg/]
>  
>  
>  
>  
> On Tue, Dec 18, 2018 at 10:07 AM    
> *Original content:* [http://www.concordpri.moe.edu.sg/]   \n\n   \n\n \n \n\n 
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 
> at 10:07 AM 
> *Index content:* [http://www.concordpri.moe.edu.sg/]   <br><br>  <br><br>On 
> Tue, Dec 18, 2018 at 10:07 AM



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13242) RegexReplaceProcessorFactory not making accurate replacement

Reply via email to