[Caja] Re: faster html-sanitizer.js (issue 4559048)

felix8a Tue, 13 Mar 2012 10:01:19 -0700

I'm going to upload a new patch momentarily, rebased from trunk, with
changes from the comments and a bunch of other minor changes.


I've got someone asking for the fix to the IE substr performance
problem, so I'd like to get this committed soon, and then work on the
other performance issues later, which are minor relative to the IE
substr problem.

On 2011/05/27 20:22:28, MikeSamuel wrote:

http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-exp.js

File src/com/google/caja/plugin/html-sanitizer-exp.js (right):



http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-exp.js#newcode1

src/com/google/caja/plugin/html-sanitizer-exp.js:1: // stub for

experimenting

with changes to html-sanitizer.js
This doesn't seem to be included in the html-sanitizer bundle in

build.xml.

intentionally excluded.  added a clarifying comment.


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-r4455.js

File src/com/google/caja/plugin/html-sanitizer-r4455.js (right):



http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-r4455.js#newcode1

src/com/google/caja/plugin/html-sanitizer-r4455.js:1: // Copyright (C)

Google Inc.
Is this the old version?  What are the long term plans for this?


originally I thought it would be helpful if users could refer to the old
version of html-sanitizer at a public url, but there doesn't seem to be
much value to that, so I've renamed the files -legacy, and now they're
only used for testing.


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js

File src/com/google/caja/plugin/html-sanitizer.js (right):



http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode192

src/com/google/caja/plugin/html-sanitizer.js:192: var ATTR_RE = new

RegExp(

Ok, so this is meant to match a name = value pair?


yes, added a comment.


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode208

src/com/google/caja/plugin/html-sanitizer.js:208: '[^\"\'\\s]*' ) +
If this is not matched against any string containing the ">" or "/>"

closing a

tag, then please document that fact and ignore the below.  Otherwise,

...

Yeah, there's never a closing > in the string matched.  Clarified in the
comment.


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode252

src/com/google/caja/plugin/html-sanitizer.js:252: // parts if we

discover

they're in a different context.
Is splitting faster than global matching?  e.g. str.match(/.../g)


I think I did an experiment with parsing using //g at one point, but I
don't remember the result of that.  I'll look at it again when I work on
the minor performance issues marked as TODO.


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode266

src/com/google/caja/plugin/html-sanitizer.js:266: var nextEndComment =

0;

Maybe initialize to -1 if my comments on uses of nextGT and

nextEndComment are

correct.


I'm not actually using this the way I thought, so I replaced it with a
flag noMoreEndComments, meaning we've scanned to the end of the input
and didn't find any end comment markers, so we don't have to repeat the
scan if we see another open comment marker.


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode267

src/com/google/caja/plugin/html-sanitizer.js:267: while (pos < end) {
The relationship between pos,end,parts and this loop might be more

obvious if

they were declared in the loop

     for (var pos = 0, end = parts.length; pos < end;) {


done


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode277

src/com/google/caja/plugin/html-sanitizer.js:277: }
Do your benchmarks take into account that there is a call to pcdata

per entity.

Some real HTML has large numbers of &nbsp;'s.


yes, some of the test cases are large number of & entities repeated, and
these are mostly insignificant performance differences between old and
new versions.

Is the fact that contiguous text segments might be split into separate

pcdata

calls a change from existing behavior?


yes, but I think it doesn't matter.  I've added a comment about it, and
also updated the description of this change to explain my reasoning.


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode320

src/com/google/caja/plugin/html-sanitizer.js:320: p = (nextEndComment

< pos + 1)

? pos + 1 : nextEndComment;
Is < pos + 1 the right boundary condition?  You've already incremented

pos at

the top of the loop, so does (pos) point at the token after the

"<!--"?

pos + 1 is correct, because '--' and '>' are separate tokens, and we're
looking for the '>'.  added a clarifying comment.


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode374

src/com/google/caja/plugin/html-sanitizer.js:374: var re =
/(<\/|<!--|<[!?]|[&<>])/g;
Why is it necessary to split on ">"?


making '>' a separate token lets us quickly find the end of a tag in
cases where quotes aren't involved, which is almost all end tags and
many start tags.  I'm not sure I tried a variant that doesn't split on
'>'; I might explore that option when working on some of the performance
TODOs.

Can you split using a forward lookahead to get consistent behavior on

IE?

var re = /(?=<(?:[/&!?]?|!--|>)/;


'axbxc'.split(/(?=x)/g) is ['a', 'xb', 'xc'], which is not the same, but
might be usable with a tweak of the algorithm.  I'll explore this later.


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode418

src/com/google/caja/plugin/html-sanitizer.js:418: function

parseText(parts, tag,

h, param) {
I haven't looked clearly at this method yet.  Can you add a comment on

what it's

supposed to do.


done


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode421

src/com/google/caja/plugin/html-sanitizer.js:421: endTagRe[tag.name] =

new

RegExp('^' + tag.name + '(?:[\\s\\/]|$)', 'i');
Ok.  Even if tag.name can contain any of [\w:-] none of those are

special in

regexps outside charsets.  Obviously . and $ would cause problems.


right


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode451

src/com/google/caja/plugin/html-sanitizer.js:451: // For now,

optimistically

assume there are no quoted '>'
What does this mean?  Is this assumption revisited?  Is this

assumption the

reason for the attr regexp that I commented upon above?


yes, clarified the comment.


http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer.js#newcode500

src/com/google/caja/plugin/html-sanitizer.js:500: if (q === '"' || q

=== "'") {

charCodeAt can be faster than charAt.
q = v.charCodeAt(0);
if (q === 0x22 || q === 0x27) {


done


http://codereview.appspot.com/4559048/

[Caja] Re: faster html-sanitizer.js (issue 4559048)

Reply via email to