[racket] Regex's and utf-8

Harry Spier Fri, 27 Jul 2012 09:32:38 -0700

Would it be possible (or would it be a good idea) for character
regex's to have a mode option "strict" or "not-strict" that would
throw an error if its input character stream contained non utf-8
characters when in strict mode.


One possible use is this.
Its real easy to accidently apply a character regex to a bytestring
(when you meant to apply a byte-string regex to a bytestream) and run
test cases and  think its working OK.
I.e. to write:
(regexp-match-positions* #rx"[^ÿ]+" #"...input byte string...")
when you meant
(regexp-match-positions* #rx#"[^ÿ]+" #". . . input byte string...")

For example this appears to work:
> (integer->char 255)
#\ÿ
> (regexp-match-positions* #rx"[^ÿ]+" #"abcÿabc")
'((0 . 3) (4 . 7))

BUT
> (regexp-match-positions* #rx"[^k]+" #"abcÿabc")
'((0 . 3) (4 . 7))
> (regexp-match-positions* #rx".+" #"abcÿabc")
'((0 . 3) (4 . 7))
> (regexp-match-positions* #rx"[^k]+" #"abcÿabc")
'((0 . 3) (4 . 7))
>
Having a "strict" mode would show up this error.

Thanks,
Harry Spier

____________________
  Racket Users list:
  http://lists.racket-lang.org/users

[racket] Regex's and utf-8

Reply via email to