updated snapshot

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js
File src/com/google/caja/plugin/html-sanitizer-exp.js (right):

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js#newcode6
src/com/google/caja/plugin/html-sanitizer-exp.js:6: */
On 2012/03/20 17:26:36, MikeSamuel wrote:
What do you mean by "this file defines"?  Are html objects defined in
other
files, like html-sanitizer-legacy, observably different?

html-sanitizer-test.html says

<script type="text/javascript" src="html-sanitizer-legacy.js"></script>
<script>var html0 = html; html = void 0;</script>

basically imitation module import.

clarified the comment to: "If this file sets the global 'html' to a
value similar to that in html-sanitizer.js"

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js
File src/com/google/caja/plugin/html-sanitizer.js (right):

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode217
src/com/google/caja/plugin/html-sanitizer.js:217: '(\')[^\']*(\'|$)' +
 // 6, 7 = Single-quoted string
On 2012/03/20 17:26:36, MikeSamuel wrote:
Do we get any benefit from doing ([\"\'])[\s\S]*?(\4|$) and avoid
having two
sets of quote groups?

I'm wary of backreferences because some regexp engines don't handle them
well.  I'll add a TODO to look into this.

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode221
src/com/google/caja/plugin/html-sanitizer.js:221:
'(?=[a-z][a-z-]*\\s*=)' +
On 2012/03/20 17:26:36, MikeSamuel wrote:
Not necessarily relevant to this CL, but if it's easier to drop this
case, I
think we can.  I think HTML5 actually settled on treating

    <a a= b=c>

as equivalent to

     <a a= "b=c">


http://www.w3.org/TR/html5/tokenization.html#before-attribute-value-state
and

http://www.w3.org/TR/html5/tokenization.html#attribute-value-unquoted-state
indicate that '=' inside an unquoted value is an error state, but the
procedure
to follow when you don't fail fast is to treat the '=' and following
content as
attribute value content.

ok, adding a TODO

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode230
src/com/google/caja/plugin/html-sanitizer.js:230: var ENTITY_RE =
/^(#[0-9]+|#[x][0-9a-f]+|\w+);/i;
On 2012/03/20 17:26:36, MikeSamuel wrote:
[x] -> x

Done.

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode233
src/com/google/caja/plugin/html-sanitizer.js:233: var splitWillCapture =
('a,b'.split(/(,)/).length === 3);
On 2012/03/20 17:26:36, MikeSamuel wrote:
Ok.  Does it matter whether ','.split(/,/).length == 2 or 0?

no, I don't care whether the result has null strings or not, I just care
whether groups capture or not.  null strings get ignored in the switch
statement in parse()

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode316
src/com/google/caja/plugin/html-sanitizer.js:316: if (m =
/^(\w+)[^\'\"]*/.exec(next)) {
On 2012/03/20 17:26:36, MikeSamuel wrote:
No $ required here or at 333?

There are three cases that need to be handed after we see a '<' or a
'</'
1. we don't have a valid tag name -> emit pcdata
2. we do have a valid tag name, and there are no quotes -> fast handling
of simple tags
3. we do have a valid tag name, and there are quotes -> slow handling of
tags with attributes.

The non-anchored regexp lets me distinguish the three cases with just
one regexp test.  An anchored regexp would require adding a second
regexp test.

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode325
src/com/google/caja/plugin/html-sanitizer.js:325: // slow case, need to
parse attributes
On 2012/03/20 17:26:36, MikeSamuel wrote:
Don't need to parse attributes on an end tag.

It's possible for someone to write </p foo="a>">, and if we don't parse
end-tag attributes here, the result would be sanitized differently.
This might be a case that we don't care about, I'll add a TODO.

http://codereview.appspot.com/4559048/

Reply via email to