Re: RegExp syntax suggestion: allow CharacterClassEscape in CharacterRange

Lasse Reichstein Thu, 09 Dec 2010 12:34:01 -0800

On Wed, 08 Dec 2010 21:43:06 +0100, Gavin Barraclough<barraclo...@apple.com> wrote:

According to the ES5 spec a regular expression such as /[\w-_]/ shouldgenerate a syntax error. Unfortunately there appears to be asignificant quantity of existing code that will break if this behavioris implemented (I have been experimenting with bringing WebKit's RegExpimplementation into closer conformance to the spec), and looking atother implementations it appears common for this error to be ignored.


It's far from the only extension to RegExp syntax that is common to most
implementations. In fact, the extensions are both extensive and consistent
across browsers. A quick check through the possible syntax errors show
the following:

// Invalid ControlEscape/IdentityEscape character treated as literal.
  /\z/;  // Invalid escape, same as /z/
// Incomplete/Invalid ControlEscape treated as either "\\c" or "c"
  /\c/;  // same as /c/ or /\\c/
  /\c2/;  // same as /c2/ or /\\c2/
// Incomplete HexEscapeSequence escape treated as either "\\x" or "x".
  /\x/;  // incomplete x-escape
  /\x1/;  // incomplete x-escape
  /\x1z/;  // incomplete x-escape
// Incomplete UnicodeEscapeSequence escape treated as either "\\u" or "u".
  /\u/;  // incomplete u-escape
  /\uz/;  // incomplete u-escape
  /\u1/;  // incomplete u-escape
  /\u1z/;  // incomplete u-escape
  /\u12/;  // incomplete u-escape
  /\u12z/;  // incomplete u-escape
  /\u123/;  // incomplete u-escape
  /\u123z/;  // incomplete u-escape
// Bad quantifier range:
  /x{z/;  // same as /x\{z/
  /x{1z/;  // same as /x\{1z/
  /x{1,z/;  // same as /x\{1,z/
  /x{1,2z/;  // same as /x\{1,2z/
  /x{10000,20000z/;  // same as /x\{10000,20000z/
// Notice: It needs arbitrary lookahead to determine the invalidity,
// except Mozilla that limits the numbers.

// Zero-initialized Octal escapes.
  /\012/;    // same as /\x0a/

// Nonexisting back-references treated as octal escapes:
  /\5/;  // same as /\x05/

// Invalid PatternCharacter accepted unescaped
  /]/;
  /{/;
  /}/;

// Bad escapes also inside CharacterClass.
  /[\z]/;
  /[\c]/;
  /[\c2]/;
  /[\x]/;
  /[\x1]/;
  /[\x1z]/;
  /[\u]/;
  /[\uz]/;
  /[\u1]/;
  /[\u1z]/;
  /[\u12]/;
  /[\u12z]/;
  /[\u123]/;
  /[\u123z]/;
  /[\012]/;
  /[\5]/;
// And in addition:
  /[\B]/;
  /()()[\2]/;  // Valid backreference should be invalid.

None of these RegExps cause a syntax error in any of the current "top-5"browsers,

even though they are (AFAICS) invalid syntax.

Most of the RegExps treat a malformed (start of a multi-character) escapesequenceas a simple identity escape or octal escape, and extends identity escapesto all charactersthat doesn't already have another meaning (ControlEscape,CharacterClassEscape or

one of c, x, u, or b, and B outside a CharacterClass).

To match the current behavior, IdentityEscape shouldn't exclude all ofIdentifierPart,

but only the characters that already mean something else.

Allowing /\c2/ to match "c2", but requiring /\CB/ to match "\x02" seemslike it would

be better explained in prose than in the BNF.

...

I'd like to propose a minimal change to hopefully allow implementationsto come into line with the spec, without breaking the web. I'd suggestchanging the first step of CharacterRange to instead read:
1. If A does not contain exactly one character or B does not containexactly one character then create a CharSet AB containing the union ofthe CharSets A and B, and return the union of CharSet AB and the CharSetcontaining the one character -.

I think this matches the current actual behavior of all the browsers, andis

short and understandable.

/Lasse R.H. Nielsen

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: RegExp syntax suggestion: allow CharacterClassEscape in CharacterRange

Reply via email to