I have just updated my UTF-8 decoder stress test file

  UTF-8-test.txt

on

  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/

(also part of <http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz>).

It now contains an additional section 5 with UTF-8 sequences for illegal
code positions that a good decoder should reject (surrogates, U+FFFE,
U+FFFF) like overlong and malformed sequences for security reasons, as
well as all the relevant legal boundary conditions for these.

I hope you'll find it useful for ironing out the last hidden bugs in
your UTF-8 decoders. Feel free to adapt the file for and include it into
your UTF-8 regression test suites.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to