This is nice work.  Comments inline.

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js
File src/com/google/caja/plugin/html-sanitizer-exp.js (right):

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js#newcode6
src/com/google/caja/plugin/html-sanitizer-exp.js:6: */
What do you mean by "this file defines"?  Are html objects defined in
other files, like html-sanitizer-legacy, observably different?

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js
File src/com/google/caja/plugin/html-sanitizer.js (right):

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode207
src/com/google/caja/plugin/html-sanitizer.js:207: // when we see that,
we combine additional tokens to balance the quote.
Ok.  So instead of doing a global match,


/<(script|textarea|xmp|style|title)\b(?:[^>"']+|"[^"]*"|'[^']*')*>(?:[^<]+| <(?!\/\1\b))*|<\/?\w+(?:[^>"']+|"[^"]*"|'[^']*')*>|<!--[\s\S]*?-->|[^<]+|</g

and then parsing for attributes inside tags, you're recombining.

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode217
src/com/google/caja/plugin/html-sanitizer.js:217: '(\')[^\']*(\'|$)' +
 // 6, 7 = Single-quoted string
Do we get any benefit from doing ([\"\'])[\s\S]*?(\4|$) and avoid having
two sets of quote groups?

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode221
src/com/google/caja/plugin/html-sanitizer.js:221:
'(?=[a-z][a-z-]*\\s*=)' +
Not necessarily relevant to this CL, but if it's easier to drop this
case, I think we can.  I think HTML5 actually settled on treating

   <a a= b=c>

as equivalent to

    <a a= "b=c">

http://www.w3.org/TR/html5/tokenization.html#before-attribute-value-state
and
http://www.w3.org/TR/html5/tokenization.html#attribute-value-unquoted-state
indicate that '=' inside an unquoted value is an error state, but the
procedure to follow when you don't fail fast is to treat the '=' and
following content as attribute value content.

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode230
src/com/google/caja/plugin/html-sanitizer.js:230: var ENTITY_RE =
/^(#[0-9]+|#[x][0-9a-f]+|\w+);/i;
[x] -> x

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode233
src/com/google/caja/plugin/html-sanitizer.js:233: var splitWillCapture =
('a,b'.split(/(,)/).length === 3);
Ok.  Does it matter whether ','.split(/,/).length == 2 or 0?

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode316
src/com/google/caja/plugin/html-sanitizer.js:316: if (m =
/^(\w+)[^\'\"]*/.exec(next)) {
No $ required here or at 333?

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode325
src/com/google/caja/plugin/html-sanitizer.js:325: // slow case, need to
parse attributes
Don't need to parse attributes on an end tag.

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode464
src/com/google/caja/plugin/html-sanitizer.js:464: if (tag.eflags &
EFLAGS_TEXT) {
Ok, so the fact that this happens here but not in parseEndTag means that
you won't treat the x in </script>x</script> as a script body.

http://codereview.appspot.com/4559048/diff/11002/tests/com/google/caja/plugin/css-stylesheet-test.html
File tests/com/google/caja/plugin/css-stylesheet-test.html (left):

http://codereview.appspot.com/4559048/diff/11002/tests/com/google/caja/plugin/css-stylesheet-test.html#oldcode22
tests/com/google/caja/plugin/css-stylesheet-test.html:22: <script
type="text/javascript" src="../../../../js/jsunit/2.2/jsUnitCore.js"
my bad

http://codereview.appspot.com/4559048/

Reply via email to