Noting Lawrence's response. I'll get his email's contents added to JIRA :) ---------- Forwarded message ---------- From: Lawrence Angrave <angr...@illinois.edu> Date: Fri, Mar 8, 2013 at 2:36 PM Subject: Re: [lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic To: Henri Yandell <flame...@gmail.com>
On 3/8/13 12:51 PM, Henri Yandell wrote: > > Thanks for the report Lawrence. > > With regards to Sebb's comment, I'm happy to put in a JIRA, but wanted > to resolve the attribution comment first: > > "> Legal: I, Lawrence Angrave (email address, angr...@illinois.edu >> >> <mailto:angr...@illinois.edu>), am the author and copyright holder of the >> new code provided in this email, and share and re-license this code under >> the current Apache2 license." > > Our solution on attribution of contributions is to add them to the > contributor list (inside the pom.xml in the source and published on > this page http://commons.apache.org/proper/commons-lang/team-list.html). > > Can you confirm that's acceptable? Yes! Thanks! > Typically we don't put the mail address in for contributors to reduce > spam hitting them. Very sensible. Glad you found it helpful. Best, Lawrence. > Thanks again, > > Hen > > On Tue, Mar 5, 2013 at 2:28 PM, Lawrence Angrave <angr...@illinois.edu> wrote: >> >> Hi, >> >> Some comments that are relevant to Apache3 UnicodeEscaper and Apache2's >> StringEscapeUtils.java >> Summary- >> >> * I noticed the current Apache code creates three String objects each >> time it writes a unicode hexadecimal value. >> * Apache3 can also create a char[] array per character translation >> (but I do include a fix for that) >> * This is a easy-to-fix performance bottleneck when writing many >> non-ascii characters. >> * The logic to test for unicode values of different magnitudes can >> also be simplified. >> * Benchmark and code fixes for Apache2 and Apache 3 are included. I do >> not have time to become an Apache maintainer. use or ignore at your >> choice. >> * I'm not interested in being a developer for Commons Lang Use it or >> not - that's a choice for Commons Lang developers. >> >> A simple fix more than doubles the string escape speed (40 ms v 100ms to >> translate all unicode characters) for Apache3. >> The older Apache2-style implementation can now translate all unicode >> characters in 8ms. >> >> The existing Apache3/Apache2 write unicode hex values like this- >> if (codepoint > 0xfff) { >> out.write("\\u" + hex(codepoint)); >> } else if (codepoint > 0xff) { >> out.write("\\u0" + hex(codepoint)); >> } else if (codepoint > 0xf) { >> out.write("\\u00" + hex(codepoint)); >> } else { >> out.write("\\u000" + hex(codepoint)); >> } >> The hex() function, >> //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH); >> also creates two string objects, so we have 3 objects per unicode hex value. >> >> FIX: >> >> The padding logic can be simplified and per-character object creation can be >> eliminated by writing hex digits directly >> out.write("\\u"); >> out.write(HEX_DIGIT[(codepoint >> 12) & 15]); >> out.write(HEX_DIGIT[(codepoint >> 8) & 15]); >> out.write(HEX_DIGIT[(codepoint >> 4) & 15]); >> out.write(HEX_DIGIT[(codepoint) & 15]); >> >> >> where HEX_DIGIT is >> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray(); >> I believe this is safe for all Locales. >> >> >> When benchmarked it was disconcerting that Apache3 is still five times >> slower (40ms instead of 8ms) than my rewritten Apache2 version (included >> below). >> My guess is that there are other unnecessary per-character object creation >> issues still lurking Here's one example - >> CharSequenceTranslator.translate(CharSequence input, Writer out) : >> char[] c = *Character.toChars*(Character.codePointAt(input, pos)) >> >> For better performance this should use toChars(int codePoint, char[] dst, >> int dstIndex) , which can re-use the dst char array >> >> >> >> The benchmark, my version of a Apache2-style escapeJavaStyleString >> implementation and the code fix for UnicodeEscaper.java are included below. >> If you do find this useful, some source-code attribution would be >> appreciated. >> Legal: I, Lawrence Angrave (email address, angr...@illinois.edu >> <mailto:angr...@illinois.edu>), am the author and copyright holder of the >> new code provided in this email, and share and re-license this code under >> the current Apache2 license. >> >> I hope this email does not go into a blackhole... Feel free to forward it to >> the relevant maintainers. >> >> Regards, >> Lawrence. >> >> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray(); >> public static final char[] CONTROL_CHARS; // non-zero entries for >> special case control characters >> static { >> CONTROL_CHARS = new char[32]; >> CONTROL_CHARS['\b'] = 'b'; >> CONTROL_CHARS['\n'] = 'n'; >> CONTROL_CHARS['\t'] = 't'; >> CONTROL_CHARS['\f'] = 'f'; >> CONTROL_CHARS['\r'] = 'r'; >> } >> public static void escapeJavaStyleString(Writer out, String s, boolean >> escapeSingleQuote) throws IOException { >> // Apache2 makes the following checks, so we will too- >> if(out==null) throw new IllegalArgumentException("The Writer must not be >> null"); >> if(s == null) return; >> >> final int len = s.length(); >> for(int i =0; i < len;i++) >> escapeChar(out,s.charAt(i), escapeSingleQuote); >> } >> public static void escapeChar(Writer out, char c, boolean >> escapeSingleQuote) >> throws IOException { >> // Most common case >> if (c >= 32 && c < 127) { >> if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote)) >> out.write('\\'); >> out.write(c); >> return; >> } >> out.write('\\'); >> if (c < 32 && CONTROL_CHARS[c] != 0) { >> out.write(CONTROL_CHARS[c]); >> return; >> } >> // Fast 4 digit uppercase hexadecimal without object creation >> out.write('u'); >> out.write(HEX_DIGIT[(c >> 12) & 15]); >> out.write(HEX_DIGIT[(c >> 8) & 15]); >> out.write(HEX_DIGIT[(c >> 4) & 15]); >> out.write(HEX_DIGIT[(c) & 15]); >> } >> >> >> >> >> FYI The benchmark test just writes all possible unicode characters into a >> null writer: >> Writer nullWriter = new Writer() { >> public void write(String s) { >> }; >> >> public void write(int c) { >> } >> >> public void close() throws IOException { >> } >> >> public void flush() throws IOException { >> } >> >> public void write(char[] cbuf, int off, int len) throws >> IOException { >> } >> }; >> StringBuilder sb = new StringBuilder(0x10000); >> for (int i = 0; i <= 0xffff; i++) >> sb.append((char) i); >> String allChars = sb.toString(); >> >> long t1 = System.currentTimeMillis(); >> StringEscaper.escapeJavaStyleString(nullWriter, allChars, true); >> long t2 = System.currentTimeMillis(); >> System.out.println(t2 - t1); >> >> long t3 = System.currentTimeMillis(); >> CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA; >> translator.translate(allChars, nullWriter); >> long t4 = System.currentTimeMillis(); >> System.out.println(t4 - t3); >> >> >> >> The modification to Apache3 UnicodeEscaper : >> >> if (codepoint > 0xffff) { >> // TODO: Figure out what to do. Output as two Unicodes? >> // Does this make this a Java-specific output class? >> out.write("\\u" + hex(codepoint)); >> } else if (1 == 0) { //*OLD SLOW CODE* (can be removed) >> *if (codepoint > 0xfff) { >> out.write("\\u" + hex(codepoint)); >> } else if (codepoint > 0xff) { >> out.write("\\u0" + hex(codepoint)); >> } else if (codepoint > 0xf) { >> out.write("\\u00" + hex(codepoint)); >> } else { >> out.write("\\u000" + hex(codepoint)); >> }* >> } else { // *NEW FAST CODE* >> * out.write("\\u"); >> out.write(HEX_DIGIT[(codepoint >> 12) & 15]); >> out.write(HEX_DIGIT[(codepoint >> 8) & 15]); >> out.write(HEX_DIGIT[(codepoint >> 4) & 15]); >> out.write(HEX_DIGIT[(codepoint) & 15]);* >> } >> >> *and add public static final char[] HEX_DIGIT = >> "0123456789ABCDEF".toCharArray();** >> * >> >> ps. >> The home page for Commons lang >> http://commons.apache.org/proper/commons-lang/) >> has the following 404 link on the left >> Contributing Patches >> http://commons.apache.org/proper/patches.html >> (I believe the 'proper' is unnecessary) >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org