[
https://issues.apache.org/jira/browse/LANG-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716850#action_12716850
]
Henri Yandell commented on LANG-505:
------------------------------------
Worked on this some more. The notion of passing in CharSequences works well,
with a codepoint version to ease coding (and currently a char version but it
may be pointless as codepoints feel like chars). I ignored the idea of having a
query as to whether an escape will happen and then a second invocation to do
the escape. It's not needed as yet.
escapeJava looks like:
{code:java}
public static void escapeJava(String input, Writer out) throws IOException {
AggregateEscaper escapers = new AggregateEscaper(
new LookupTranslator(
new String[][] {
{"\"", "\\\""},
{"\\", "\\\\"}
}),
new EscapeLowAsciiAsUnicode(),
new EscapeNonAsciiAsUnicode()
);
escape(input, escapers, out);
}
{code}
ie) despite the API change, much the same as the above. I went ahead and
stopped the Escaper extending FilterWriter so that's why the out variable stops
getting repeated everywhere.
The core algorithm itself now looks like:
{code:java}
public static void escape(CharSequence input, CharSequenceEscaper escaper,
Writer out) throws IOException {
if (out == null) {
throw new IllegalArgumentException("The Writer must not be null");
}
if (input == null) {
return;
}
if (escaper == null) {
throw new IllegalArgumentException("The CharSequenceEscaper must
not be null. " +
"Use NullEscaper if you expected
this to mean a no-operation");
}
int sz = Character.codePointCount(input, 0, input.length());
for (int i = 0; i < sz; i++) {
// consumed is the number of codepoints consumed
int consumed = escaper.escape(input, i, out);
if(consumed == 0) {
out.write( Character.toChars( Character.codePointAt(input, i) )
);
} else {
// contract with escapers is that they have to understand
codepoints and they just took care of a surrogate pair
for(int j=0; j<consumed; j++) {
if(i < sz - 2) {
// for loop will increment 1 anyway, so remove 1 from
the end charCount
i += Character.charCount( Character.codePointAt(input,
i) ) - 1;
}
}
}
}
}
{code}
This should be able to implement unescape as well as it is now a simple general
purpose text translation API that can handle escaping n->m changes in
characters. ie) a plugin can choose to consume 3 characters and turn them into
5, and vice versa.
> Rewrite StringEscapeUtils
> -------------------------
>
> Key: LANG-505
> URL: https://issues.apache.org/jira/browse/LANG-505
> Project: Commons Lang
> Issue Type: Task
> Reporter: Henri Yandell
> Fix For: 3.0
>
>
> I think StringEscapeUtils needs a strong rewrite. For each escape method (and
> unescape) there tend to be three or four types of escaping happening. So not
> being able to define which set of three or four apply is a pain point (and
> cause of bug reports due to different desired features).
> We should be offering basic functionality, but also allowing people to say
> "escape(Escapers.BASIC_XML, Escapers.LOW_UNICODE, Escapers.HIGH_UNICODE)".
> Also should delete escapeSql; it's a bad one imo. Dangerous in that it will
> lead people to not use PreparedStatement and given it only escapes ', it's
> not much use. Especially as different dialects escape that in different ways.
> Opening this ticket for discussion.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.