https://bugs.exim.org/show_bug.cgi?id=2472

            Bug ID: 2472
           Summary: Feature Request: PCRE2_SUBSTITUTE_LITERAL option for
                    pcre2_substitute without processing replacement
                    strings
           Product: PCRE
           Version: N/A
          Hardware: x86
                OS: Windows
            Status: NEW
          Severity: wishlist
          Priority: medium
         Component: Code
          Assignee: p...@hermes.cam.ac.uk
          Reporter: ew3...@gmail.com
                CC: pcre-dev@exim.org

Hi,

I am following the guidelines on https://pcre.org/ to file a feature request by
opening a bug ticket. 
I also tried searching for literal and pcre2_substitute in the closed and open
bug section but was not able to find a similar feature request.

Description:
------------
I think an additional option e.g. PCRE2_SUBSTITUTE_LITERAL 
which specifies that the replacement string in pcre2_substitute 
should not be processed at all would be useful for many programs
that utilize pcre2_substitute.

Rationale
---------
I believe a common use case is when arbitrary replacement strings 
are obtained from an external source and copying replacement strings 
for preprocessing/escaping is to be avoided.
One example that should be quite common are many long strings 
with monetary values such as 

            "....amounts to $10 in value...."

(here the replacement string refers to the currency symbol $ for a monetary
dollar value).
Currently this would have to be escaped as 

            "....amounts to $$10 in value...."

or with extended syntax

            "\Q....amounts to $10 in value....\E"

according to https://pcre.org/current/doc/html/pcre2api.html#substitutions.
My personal use case is obtaining the replacement strings inside a 
user defined function of a database application.

Comparison to other PCRE2 options
--------------------------------
A similar option PCRE2_LITERAL is available for pcre_compile despite regular
expressions not being efficient for its use case.
The proposed option would be the counterpart to PCRE2_SUBSTITUTE_EXTENDED.
While PCRE2_SUBSTITUTE_EXTENDED increases replacement string processing 
complexity, PCRE2_SUBSTITUTE_LITERAL would decrease it.


Disadvantages of Alternatives
-----------------------------
Escape Replacement String
  Replacement strings need to be copied to a new buffer and escaped.
  This requires extra memory and knowledge of which characters are to be 
  escaped ($).

Extended syntax e.g. \Q \E
  Extended sytnax also requires a new copy and 
  adding \Q and \E as well as escaping \E in the replacement string.

Substitution callouts
  A placeholder replacement string could be handed to pcre2_compile
  (e.g. empty string) and literal replacement handled by a callout.
  This is not only cumbersome but also makes  
  PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
  not easy to use because callouts are not called for overflows.

Implementing a separate routine based on pcre2_substitute
  Implementing a correct routine that behaves as pcre2_substitute does 
  is not trivial and some internal methods that pcre2_substitute uses are 
  not exported.
  (e.g. UTF checks or direct access to the callouts set in the match context 
  which would require a different parameter set in the separate 
  implementation to handle callouts). 
  - Actually get_callout and get_substitute_callout functionality
  with the public headers seems something that could also be useful but is 
  not part of this feature request).

Implementation Thoughts
----------------------
I hope some thoughts on untested code are appropriate here. 
I could not find a guideline with respect to that and I saw some code in other
reports. 
My first impression is that since the size of the replacement is known and it
is constant, one could call the CHECKMEMCPY macro in pcre2_substitute.c before
replacement processing and skip entering the replacement string processing loop
with the next relevant section being the callout section.
E.g. 

     BOOL all_literal = ((options & PCRE2_SUBSTITUTE_LITERAL)!=0);
...
     if (all_literal) {
        CHECKMEMCPY(replacement,rlength); 
        // skip replacement processing loop ...
     } else {
        // replacement processing loop ...
     }
     //callout section

The cost of such implementation would then be an option bit of the match
options and one additional if check within the global loop of pcre2_substitute.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to