Re: [lang] Suggested alternatives for escape functions

Peter Wall Tue, 10 Dec 2013 16:31:49 -0800

Hi Bernd,

Thank you for taking the time to look at my submission. Let me see ifI can answer your comments:

1. I have a separate version (which I did not include in my originalemail; I thought it was already long enough) which handles UTF-16strings, that is, strings which could include Unicode surrogatesequences:

public static final String escapeUTF16(String s, CharMapper mapper){char ch1 = '\0', ch2 = '\0'; // avoid "possibly uninitialised"errors

        for (int i = 0, n = s.length(); i < n; ) {
            int k = i;
            ch1 = s.charAt(i++);
            String mapped;
            if (Character.isHighSurrogate(ch1)) {

if (i >= n || !Character.isLowSurrogate(ch2 =s.charAt(i++)))throw new IllegalArgumentException("Illegalsurrogate sequence");

                mapped = mapper.map(Character.toCodePoint(ch1, ch2));
            }
            else
                mapped = mapper.map(ch1);
            if (mapped != null) {
                StringBuilder sb = new StringBuilder();
                for (int j = 0; j < k; ++j)
                    sb.append(s.charAt(j));
                sb.append(mapped);
                while (i < n) {
                    ch1 = s.charAt(i++);
                    if (Character.isHighSurrogate(ch1)) {

if (i >= n || !Character.isLowSurrogate(ch2 =s.charAt(i++)))throw new IllegalArgumentException("Illegalsurrogate sequence");mapped = mapper.map(Character.toCodePoint(ch1,ch2));

                    }
                    else
                        mapped = mapper.map(ch1);
                    if (mapped != null)
                        sb.append(mapped);
                    else if (Character.isHighSurrogate(ch1))
                        sb.append(ch1).append(ch2);
                    else
                        sb.append(ch1);
                }
                return sb.toString();
            }
        }
        return s;
    }

As you can see, this uses the same CharMapper, and in this case it iscalled with a full Unicode code point. Whether to throw an exception orsimply to process the characters anyway in the case of an erroneoussurrogate sequence is a matter of debate; I have chosen the former inthis case but I could be persuaded otherwise.

2. In different iterations of this code I have attempted to estimatethe output length and pre-allocate the StringBuilder, but estimates aredifficult. My most recent attempt used double the input string length,but for a 2-character string, where both characters convert to8-character sequences, this would be worse than the StringBuilderdefault (of 16). Perhaps double the input string length plus 20 wouldbe a good estimate. I'm happy to take suggestions on this point.

3. I have a separate version of escape (and escapeUTF16) which takes aCharSequence and returns a CharSequence as output (in line with myprinciple of returning the input object unmodified if it needs noconversion). The code is identical except that 'return sb.toString();'becomes 'return sb;'. I realise that calling toString() on a Stringwould return 'this' so there would be no unnecessary object allocationif I were to take a CharSequence as input and return a String. Again, Iam happy to take suggestions.


Regards,
Peter


On 2013-12-11 03:39, Bernd Eckenfels wrote:

Hello,

it depends on what you want to escape, a single Unicode character
could be  2 codepoints (UTF-16 codepoints can only cover the BMP). So
having a  String typed needle can be helpfull. But of course all the
usual things  are single-codepoint characters (<>&"...). Having said
that, any reason  why CharMappter takes an integer not a char? Thats
missleading in this  context if someone expects it to be a real
codepoint - which it is not  (using charAt()).

Besides that, the implementation copies single characters to the new
StringBuffer and produces multiple String buffers in a look without
guessing the initial lengt. That does not look like a efficient
implementation to the problem to me. Not sure where I have seen the
functions which handle that, maybe in one of the xml parsers.

BTW: maybe also the input should be a CharSequence not a String?

Greetings
Bernd

Am 10.12.2013, 05:14 Uhr, schrieb Peter Wall <[email protected]>:
Hi, I'm new here, so please forgive me if I'm duplicating a previousdiscussion (I looked back through several months of archives forsomething related, before suffering a near-fatal attack of tl;dr).
I have a toolbox of functions that I have accumulated over the yearsand among them are "escape" functions for converting, for example,XML "&" to "&" etc. When I showed these to a colleague he askedwhy I didn't use the Apache Commons utilities, so I benchmarked myfunctions against the Commons versions and found that mine wereapproximately 10 times faster. At which point the same colleaguesuggested submitting my versions to Apache, so here goes.
The code in org.apache.commons.lang3.text.translate is very elegantin the way it uses the same code and the same initialisationcharacter arrays for both the escape and the unescape functions, butthis elegance comes at a cost. The unescape will need to look upmulti-character sequences, but the escape code will ALWAYS be lookingup single characters, and this can be made much simpler than a stringmatch. And in my view the function should never allocate a newobject until it finds that it needs to do so - in many cases thestring will not need to be modified at all so the original stringshould be returned.
The escape function is:
public static final String escape(String s, CharMapper mapper){
         for (int i = 0, n = s.length(); i < n; ) {
             char ch = s.charAt(i++);
             String mapped = mapper.map(ch);
             if (mapped != null) {
                 StringBuilder sb = new StringBuilder();
                 for (int j = 0, k = i - 1; j < k; ++j)
                     sb.append(s.charAt(j));
                 sb.append(mapped);
                 while (i < n) {
                     ch = s.charAt(i++);
                     mapped = mapper.map(ch);
                     if (mapped != null)
                         sb.append(mapped);
                     else
                         sb.append(ch);
                 }
                 return sb.toString();
             }
         }
         return s;
     }

Where CharMapper is:

     public interface CharMapper {
         String map(int codePoint);
     }

and the implementation for XML is:
private static final CharMapper allCharMapper = newCharMapper() {
         @Override
         public String map(int codePoint) {
             if (codePoint == '<')
                 return "&lt;";
             if (codePoint == '>')
                 return "&gt;";
             if (codePoint == '&')
                 return "&amp;";
             if (codePoint == '"')
                 return "&quot;";
             if (codePoint == '\'')
                 return "&apos;";
if (codePoint < ' ' && !isWhiteSpace(codePoint) ||codePoint >= 0x7F) {// isWhitespace checks for XML whitespacecharacters, \n \r etc.
                 StringBuilder sb = new StringBuilder(10);
                 sb.append("&#");
                 sb.append(codePoint);
                 sb.append(';');
                 return sb.toString();
             }
             return null;
         }
     };

The whole thing can be wrapped in a simple function like:

     public static String escapeAll(String s) {
         return escape(s, allCharMapper);
     }
I have versions for Java string escapes, XML, HTML (including thefull range of entity names) and URI percent encoding, and I haveversions that handle UTF-16 surrogate codes. They all performapproxiamtely an order of magnitude better than the existing ApacheCommons functons. They are currently under LGPL and I have JUnittests for all of them.
One thing to note is that my versions convert all characters over0x7F to numeric character references, thus sidestepping any concernsover UTF-8 or ISO-8859-1 character set encoding.
Is anyone interested?

Regards,
Peter Wall



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [lang] Suggested alternatives for escape functions

Reply via email to