I have just updated my UTF-8 decoder stress test file
UTF-8-test.txt
on
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/
(also part of <http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz>).
It now contains an additional section 5 with UTF-8 sequences for illegal
code positions that a good decoder should reject (surrogates, U+FFFE,
U+FFFF) like overlong and malformed sequences for security reasons, as
well as all the relevant legal boundary conditions for these.
I hope you'll find it useful for ironing out the last hidden bugs in
your UTF-8 decoders. Feel free to adapt the file for and include it into
your UTF-8 regression test suites.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/