Fwd: [lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic

Henri Yandell Fri, 08 Mar 2013 15:22:08 -0800

Noting Lawrence's response. I'll get his email's contents added to JIRA :)

---------- Forwarded message ----------
From: Lawrence Angrave <[email protected]>
Date: Fri, Mar 8, 2013 at 2:36 PM
Subject: Re: [lang] Fix for Apache2,Apache3
StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii
char) & simpler unicode hex logic
To: Henri Yandell <[email protected]>



On 3/8/13 12:51 PM, Henri Yandell wrote:
>
> Thanks for the report Lawrence.
>
> With regards to Sebb's comment, I'm happy to put in a JIRA, but wanted
> to resolve the attribution comment first:
>
> "> Legal:  I, Lawrence Angrave (email address, [email protected]
>>
>> <mailto:[email protected]>), am the author and copyright holder of the
>> new code provided in this email, and share and re-license this code under
>> the current Apache2 license."
>
> Our solution on attribution of contributions is to add them to the
> contributor list (inside the pom.xml in the source and published on
> this page http://commons.apache.org/proper/commons-lang/team-list.html).
>
> Can you confirm that's acceptable?

Yes! Thanks!




> Typically we don't put the mail address in for contributors to reduce
> spam hitting them.

Very sensible. Glad you found it helpful.
Best,
Lawrence.

> Thanks again,
>
> Hen
>
> On Tue, Mar 5, 2013 at 2:28 PM, Lawrence Angrave <[email protected]> wrote:
>>
>> Hi,
>>
>> Some comments that are relevant to  Apache3 UnicodeEscaper and Apache2's
>> StringEscapeUtils.java
>> Summary-
>>
>>   * I noticed the current Apache code creates three String objects each
>>     time it writes a unicode hexadecimal  value.
>>   * Apache3 can also create a char[] array per character translation
>>     (but I do include a fix for that)
>>   * This is a easy-to-fix performance bottleneck when writing many
>>     non-ascii characters.
>>   * The logic to test for unicode values of different magnitudes can
>>     also be simplified.
>>   * Benchmark and code fixes for Apache2 and Apache 3 are included. I do
>>     not have time to become an Apache maintainer. use or ignore at your
>>     choice.
>>   * I'm not interested in being a developer for Commons Lang  Use it or
>>     not  - that's a choice for Commons Lang developers.
>>
>> A simple fix more than doubles the string escape speed (40 ms v 100ms to
>> translate all unicode characters) for Apache3.
>> The older Apache2-style implementation can now translate all unicode
>> characters in 8ms.
>>
>> The existing Apache3/Apache2 write unicode hex values like this-
>>          if (codepoint > 0xfff) {
>>                  out.write("\\u" + hex(codepoint));
>>              } else if (codepoint > 0xff) {
>>                  out.write("\\u0" + hex(codepoint));
>>              } else if (codepoint > 0xf) {
>>                  out.write("\\u00" + hex(codepoint));
>>              } else {
>>                  out.write("\\u000" + hex(codepoint));
>>              }
>> The hex() function,
>> //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH);
>> also creates two string objects, so we have 3 objects per unicode hex value.
>>
>> FIX:
>>
>> The padding logic can be simplified and per-character object creation can be
>> eliminated by writing hex digits directly
>>              out.write("\\u");
>>              out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>>              out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>>              out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>>              out.write(HEX_DIGIT[(codepoint) & 15]);
>>
>>
>> where  HEX_DIGIT  is
>> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
>> I believe this is safe for all Locales.
>>
>>
>> When benchmarked it was disconcerting that Apache3 is still five times
>> slower (40ms instead of 8ms) than my rewritten Apache2 version (included
>> below).
>> My guess is that there are other unnecessary per-character object creation
>> issues still lurking Here's one example -
>> CharSequenceTranslator.translate(CharSequence input, Writer out) :
>>         char[] c = *Character.toChars*(Character.codePointAt(input, pos))
>>
>> For better performance this should use toChars(int codePoint,  char[] dst,
>> int dstIndex) , which can re-use the dst char array
>>
>>
>>
>> The benchmark, my version of a  Apache2-style escapeJavaStyleString
>> implementation and the code fix for UnicodeEscaper.java  are included below.
>> If you do find this useful, some source-code attribution would be
>> appreciated.
>> Legal:  I, Lawrence Angrave (email address, [email protected]
>> <mailto:[email protected]>), am the author and copyright holder of the
>> new code provided in this email, and share and re-license this code under
>> the current Apache2 license.
>>
>> I hope this email does not go into a blackhole... Feel free to forward it to
>> the relevant maintainers.
>>
>> Regards,
>> Lawrence.
>>
>>      public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
>>      public static final char[] CONTROL_CHARS; // non-zero entries for
>> special case control characters
>>      static {
>>          CONTROL_CHARS = new char[32];
>>          CONTROL_CHARS['\b'] = 'b';
>>          CONTROL_CHARS['\n'] = 'n';
>>          CONTROL_CHARS['\t'] = 't';
>>          CONTROL_CHARS['\f'] = 'f';
>>          CONTROL_CHARS['\r'] = 'r';
>>      }
>>      public static void  escapeJavaStyleString(Writer out, String s, boolean
>> escapeSingleQuote) throws IOException {
>> // Apache2 makes the following checks, so we will too-
>>      if(out==null) throw new IllegalArgumentException("The Writer must not be
>> null");
>>          if(s == null) return;
>>
>>          final int len = s.length();
>>          for(int i =0; i < len;i++)
>>              escapeChar(out,s.charAt(i), escapeSingleQuote);
>>      }
>>      public static void escapeChar(Writer out, char c, boolean
>> escapeSingleQuote)
>>              throws IOException {
>>          // Most common case
>>          if (c >= 32 && c < 127) {
>>              if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote))
>>                  out.write('\\');
>>              out.write(c);
>>              return;
>>          }
>>          out.write('\\');
>>          if (c < 32 && CONTROL_CHARS[c] != 0) {
>>              out.write(CONTROL_CHARS[c]);
>>              return;
>>          }
>>          // Fast 4 digit uppercase hexadecimal without object creation
>>          out.write('u');
>>          out.write(HEX_DIGIT[(c >> 12) & 15]);
>>          out.write(HEX_DIGIT[(c >> 8) & 15]);
>>          out.write(HEX_DIGIT[(c >> 4) & 15]);
>>          out.write(HEX_DIGIT[(c) & 15]);
>>      }
>>
>>
>>
>>
>> FYI The benchmark test just writes all possible unicode characters into a
>> null writer:
>>              Writer nullWriter = new Writer() {
>>              public void write(String s) {
>>              };
>>
>>              public void write(int c) {
>>              }
>>
>>              public void close() throws IOException {
>>              }
>>
>>              public void flush() throws IOException {
>>              }
>>
>>              public void write(char[] cbuf, int off, int len) throws
>> IOException {
>>              }
>>          };
>>          StringBuilder sb = new StringBuilder(0x10000);
>>          for (int i = 0; i <= 0xffff; i++)
>>              sb.append((char) i);
>>          String allChars = sb.toString();
>>
>>          long t1 = System.currentTimeMillis();
>>          StringEscaper.escapeJavaStyleString(nullWriter, allChars, true);
>>          long t2 = System.currentTimeMillis();
>>          System.out.println(t2 - t1);
>>
>>          long t3 = System.currentTimeMillis();
>>          CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA;
>>          translator.translate(allChars, nullWriter);
>>          long t4 = System.currentTimeMillis();
>>          System.out.println(t4 - t3);
>>
>>
>>
>> The modification to Apache3 UnicodeEscaper :
>>
>>          if (codepoint > 0xffff) {
>>              // TODO: Figure out what to do. Output as two Unicodes?
>>              // Does this make this a Java-specific output class?
>>              out.write("\\u" + hex(codepoint));
>>          } else if (1 == 0) { //*OLD SLOW CODE* (can be removed)
>> *if (codepoint > 0xfff) {
>>                  out.write("\\u" + hex(codepoint));
>>              } else if (codepoint > 0xff) {
>>                  out.write("\\u0" + hex(codepoint));
>>              } else if (codepoint > 0xf) {
>>                  out.write("\\u00" + hex(codepoint));
>>              } else {
>>                  out.write("\\u000" + hex(codepoint));
>>              }*
>>          } else { // *NEW FAST CODE*
>> *            out.write("\\u");
>>              out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>>              out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>>              out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>>              out.write(HEX_DIGIT[(codepoint) & 15]);*
>>          }
>>
>> *and add    public static final char[] HEX_DIGIT =
>> "0123456789ABCDEF".toCharArray();**
>> *
>>
>> ps.
>> The home page for Commons lang
>> http://commons.apache.org/proper/commons-lang/)
>>   has the following 404 link on the left
>>      Contributing Patches
>> http://commons.apache.org/proper/patches.html
>> (I believe the 'proper' is unnecessary)
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Fwd: [lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic

Reply via email to