Re: [lang] Suggested alternatives for escape functions

Bernd Eckenfels Tue, 10 Dec 2013 08:40:00 -0800

Hello,

it depends on what you want to escape, a single Unicode character could be2 codepoints (UTF-16 codepoints can only cover the BMP). So having aString typed needle can be helpfull. But of course all the usual thingsare single-codepoint characters (<>&"...). Having said that, any reasonwhy CharMappter takes an integer not a char? Thats missleading in thiscontext if someone expects it to be a real codepoint - which it is not(using charAt()).

Besides that, the implementation copies single characters to the newStringBuffer and produces multiple String buffers in a look withoutguessing the initial lengt. That does not look like a efficientimplementation to the problem to me. Not sure where I have seen thefunctions which handle that, maybe in one of the xml parsers.


BTW: maybe also the input should be a CharSequence not a String?

Greetings
Bernd

Am 10.12.2013, 05:14 Uhr, schrieb Peter Wall <pw...@pwall.net>:

Hi, I'm new here, so please forgive me if I'm duplicating a previousdiscussion (I looked back through several months of archives forsomething related, before suffering a near-fatal attack of tl;dr).
I have a toolbox of functions that I have accumulated over the years andamong them are "escape" functions for converting, for example, XML "&"to "&" etc. When I showed these to a colleague he asked why Ididn't use the Apache Commons utilities, so I benchmarked my functionsagainst the Commons versions and found that mine were approximately 10times faster. At which point the same colleague suggested submitting myversions to Apache, so here goes.
The code in org.apache.commons.lang3.text.translate is very elegant inthe way it uses the same code and the same initialisation characterarrays for both the escape and the unescape functions, but this elegancecomes at a cost. The unescape will need to look up multi-charactersequences, but the escape code will ALWAYS be looking up singlecharacters, and this can be made much simpler than a string match. Andin my view the function should never allocate a new object until itfinds that it needs to do so - in many cases the string will not need tobe modified at all so the original string should be returned.
The escape function is:

     public static final String escape(String s, CharMapper mapper) {
         for (int i = 0, n = s.length(); i < n; ) {
             char ch = s.charAt(i++);
             String mapped = mapper.map(ch);
             if (mapped != null) {
                 StringBuilder sb = new StringBuilder();
                 for (int j = 0, k = i - 1; j < k; ++j)
                     sb.append(s.charAt(j));
                 sb.append(mapped);
                 while (i < n) {
                     ch = s.charAt(i++);
                     mapped = mapper.map(ch);
                     if (mapped != null)
                         sb.append(mapped);
                     else
                         sb.append(ch);
                 }
                 return sb.toString();
             }
         }
         return s;
     }

Where CharMapper is:

     public interface CharMapper {
         String map(int codePoint);
     }

and the implementation for XML is:

     private static final CharMapper allCharMapper = new CharMapper() {
         @Override
         public String map(int codePoint) {
             if (codePoint == '<')
                 return "&lt;";
             if (codePoint == '>')
                 return "&gt;";
             if (codePoint == '&')
                 return "&amp;";
             if (codePoint == '"')
                 return "&quot;";
             if (codePoint == '\'')
                 return "&apos;";
if (codePoint < ' ' && !isWhiteSpace(codePoint) ||codePoint >= 0x7F) {// isWhitespace checks for XML whitespace characters,\n \r etc.
                 StringBuilder sb = new StringBuilder(10);
                 sb.append("&#");
                 sb.append(codePoint);
                 sb.append(';');
                 return sb.toString();
             }
             return null;
         }
     };

The whole thing can be wrapped in a simple function like:

     public static String escapeAll(String s) {
         return escape(s, allCharMapper);
     }
I have versions for Java string escapes, XML, HTML (including the fullrange of entity names) and URI percent encoding, and I have versionsthat handle UTF-16 surrogate codes. They all perform approxiamtely anorder of magnitude better than the existing Apache Commons functons.They are currently under LGPL and I have JUnit tests for all of them.
One thing to note is that my versions convert all characters over 0x7Fto numeric character references, thus sidestepping any concerns overUTF-8 or ISO-8859-1 character set encoding.
Is anyone interested?

Regards,
Peter Wall


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



--
http://www.zusammenkunft.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [lang] Suggested alternatives for escape functions

Reply via email to