updated snapshot
http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js
File src/com/google/caja/plugin/html-sanitizer-exp.js (right):
http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js#newcode6
src/com/google/caja/plugin/html-sanitizer-exp.js:6: */
On 2012/03/20 17:26:36, MikeSamuel wrote:
What do you mean by "this file defines"? Are html objects defined in
other
files, like html-sanitizer-legacy, observably different?
html-sanitizer-test.html says
<script type="text/javascript" src="html-sanitizer-legacy.js"></script>
<script>var html0 = html; html = void 0;</script>
basically imitation module import.
clarified the comment to: "If this file sets the global 'html' to a
value similar to that in html-sanitizer.js"
http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js
File src/com/google/caja/plugin/html-sanitizer.js (right):
http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode217
src/com/google/caja/plugin/html-sanitizer.js:217: '(\')[^\']*(\'|$)' +
// 6, 7 = Single-quoted string
On 2012/03/20 17:26:36, MikeSamuel wrote:
Do we get any benefit from doing ([\"\'])[\s\S]*?(\4|$) and avoid
having two
sets of quote groups?
I'm wary of backreferences because some regexp engines don't handle them
well. I'll add a TODO to look into this.
http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode221
src/com/google/caja/plugin/html-sanitizer.js:221:
'(?=[a-z][a-z-]*\\s*=)' +
On 2012/03/20 17:26:36, MikeSamuel wrote:
Not necessarily relevant to this CL, but if it's easier to drop this
case, I
think we can. I think HTML5 actually settled on treating
<a a= b=c>
as equivalent to
<a a= "b=c">
http://www.w3.org/TR/html5/tokenization.html#before-attribute-value-state
and
http://www.w3.org/TR/html5/tokenization.html#attribute-value-unquoted-state
indicate that '=' inside an unquoted value is an error state, but the
procedure
to follow when you don't fail fast is to treat the '=' and following
content as
attribute value content.
ok, adding a TODO
http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode230
src/com/google/caja/plugin/html-sanitizer.js:230: var ENTITY_RE =
/^(#[0-9]+|#[x][0-9a-f]+|\w+);/i;
On 2012/03/20 17:26:36, MikeSamuel wrote:
[x] -> x
Done.
http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode233
src/com/google/caja/plugin/html-sanitizer.js:233: var splitWillCapture =
('a,b'.split(/(,)/).length === 3);
On 2012/03/20 17:26:36, MikeSamuel wrote:
Ok. Does it matter whether ','.split(/,/).length == 2 or 0?
no, I don't care whether the result has null strings or not, I just care
whether groups capture or not. null strings get ignored in the switch
statement in parse()
http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode316
src/com/google/caja/plugin/html-sanitizer.js:316: if (m =
/^(\w+)[^\'\"]*/.exec(next)) {
On 2012/03/20 17:26:36, MikeSamuel wrote:
No $ required here or at 333?
There are three cases that need to be handed after we see a '<' or a
'</'
1. we don't have a valid tag name -> emit pcdata
2. we do have a valid tag name, and there are no quotes -> fast handling
of simple tags
3. we do have a valid tag name, and there are quotes -> slow handling of
tags with attributes.
The non-anchored regexp lets me distinguish the three cases with just
one regexp test. An anchored regexp would require adding a second
regexp test.
http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode325
src/com/google/caja/plugin/html-sanitizer.js:325: // slow case, need to
parse attributes
On 2012/03/20 17:26:36, MikeSamuel wrote:
Don't need to parse attributes on an end tag.
It's possible for someone to write </p foo="a>">, and if we don't parse
end-tag attributes here, the result would be sanitized differently.
This might be a case that we don't care about, I'll add a TODO.
http://codereview.appspot.com/4559048/