I'm going to upload a new patch momentarily, rebased from trunk, with
changes from the comments and a bunch of other minor changes.
I've got someone asking for the fix to the IE substr performance
problem, so I'd like to get this committed soon, and then work on the
other performance issues later, which are minor relative to the IE
substr problem.
On 2011/05/27 20:22:28, MikeSamuel wrote:
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-exp.js
File src/com/google/caja/plugin/html-sanitizer-exp.js (right):
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-exp.js#newcode1
src/com/google/caja/plugin/html-sanitizer-exp.js:1: // stub for
experimenting
with changes to html-sanitizer.js
This doesn't seem to be included in the html-sanitizer bundle in
build.xml.
intentionally excluded. added a clarifying comment.
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-r4455.js
File src/com/google/caja/plugin/html-sanitizer-r4455.js (right):
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-r4455.js#newcode1
src/com/google/caja/plugin/html-sanitizer-r4455.js:1: // Copyright (C)
2006
Google Inc.
Is this the old version? What are the long term plans for this?
originally I thought it would be helpful if users could refer to the old
version of html-sanitizer at a public url, but there doesn't seem to be
much value to that, so I've renamed the files -legacy, and now they're
only used for testing.
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js
File src/com/google/caja/plugin/html-sanitizer.js (right):
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode192
src/com/google/caja/plugin/html-sanitizer.js:192: var ATTR_RE = new
RegExp(
Ok, so this is meant to match a name = value pair?
yes, added a comment.
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode208
src/com/google/caja/plugin/html-sanitizer.js:208: '[^\"\'\\s]*' ) +
If this is not matched against any string containing the ">" or "/>"
closing a
tag, then please document that fact and ignore the below. Otherwise,
...
Yeah, there's never a closing > in the string matched. Clarified in the
comment.
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode252
src/com/google/caja/plugin/html-sanitizer.js:252: // parts if we
discover
they're in a different context.
Is splitting faster than global matching? e.g. str.match(/.../g)
I think I did an experiment with parsing using //g at one point, but I
don't remember the result of that. I'll look at it again when I work on
the minor performance issues marked as TODO.
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode266
src/com/google/caja/plugin/html-sanitizer.js:266: var nextEndComment =
0;
Maybe initialize to -1 if my comments on uses of nextGT and
nextEndComment are
correct.
I'm not actually using this the way I thought, so I replaced it with a
flag noMoreEndComments, meaning we've scanned to the end of the input
and didn't find any end comment markers, so we don't have to repeat the
scan if we see another open comment marker.
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode267
src/com/google/caja/plugin/html-sanitizer.js:267: while (pos < end) {
The relationship between pos,end,parts and this loop might be more
obvious if
they were declared in the loop
for (var pos = 0, end = parts.length; pos < end;) {
done
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode277
src/com/google/caja/plugin/html-sanitizer.js:277: }
Do your benchmarks take into account that there is a call to pcdata
per entity.
Some real HTML has large numbers of 's.
yes, some of the test cases are large number of & entities repeated, and
these are mostly insignificant performance differences between old and
new versions.
Is the fact that contiguous text segments might be split into separate
pcdata
calls a change from existing behavior?
yes, but I think it doesn't matter. I've added a comment about it, and
also updated the description of this change to explain my reasoning.
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode320
src/com/google/caja/plugin/html-sanitizer.js:320: p = (nextEndComment
< pos + 1)
? pos + 1 : nextEndComment;
Is < pos + 1 the right boundary condition? You've already incremented
pos at
the top of the loop, so does (pos) point at the token after the
"<!--"?
pos + 1 is correct, because '--' and '>' are separate tokens, and we're
looking for the '>'. added a clarifying comment.
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode374
src/com/google/caja/plugin/html-sanitizer.js:374: var re =
/(<\/|<!--|<[!?]|[&<>])/g;
Why is it necessary to split on ">"?
making '>' a separate token lets us quickly find the end of a tag in
cases where quotes aren't involved, which is almost all end tags and
many start tags. I'm not sure I tried a variant that doesn't split on
'>'; I might explore that option when working on some of the performance
TODOs.
Can you split using a forward lookahead to get consistent behavior on
IE?
var re = /(?=<(?:[/&!?]?|!--|>)/;
'axbxc'.split(/(?=x)/g) is ['a', 'xb', 'xc'], which is not the same, but
might be usable with a tweak of the algorithm. I'll explore this later.
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode418
src/com/google/caja/plugin/html-sanitizer.js:418: function
parseText(parts, tag,
h, param) {
I haven't looked clearly at this method yet. Can you add a comment on
what it's
supposed to do.
done
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode421
src/com/google/caja/plugin/html-sanitizer.js:421: endTagRe[tag.name] =
new
RegExp('^' + tag.name + '(?:[\\s\\/]|$)', 'i');
Ok. Even if tag.name can contain any of [\w:-] none of those are
special in
regexps outside charsets. Obviously . and $ would cause problems.
right
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode451
src/com/google/caja/plugin/html-sanitizer.js:451: // For now,
optimistically
assume there are no quoted '>'
What does this mean? Is this assumption revisited? Is this
assumption the
reason for the attr regexp that I commented upon above?
yes, clarified the comment.
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode500
src/com/google/caja/plugin/html-sanitizer.js:500: if (q === '"' || q
=== "'") {
charCodeAt can be faster than charAt.
q = v.charCodeAt(0);
if (q === 0x22 || q === 0x27) {
done
http://codereview.appspot.com/4559048/