Re: UTF-8 stress test file?

2004-10-12 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] Theodore H. Smith delete at elfdata dot com wrote: - the file mixes UTF-8 and UTF-16 Does this file mix UTF-8 and UTF-16? I thought it just had surrogates encoded into UTF-8? Of course a surrogate should never exist in UTF-8. You are right. Philippe's statement

Re: UTF-8 stress test file?

2004-10-12 Thread Clark Cox
On Tue, 12 Oct 2004 20:25:16 +0200, Philippe Verdy [EMAIL PROTECTED] wrote: From: Doug Ewell [EMAIL PROTECTED] Theodore H. Smith delete at elfdata dot com wrote: - the file mixes UTF-8 and UTF-16 Does this file mix UTF-8 and UTF-16? I thought it just had surrogates encoded into UTF-8?

Re: UTF-8 stress test file?

2004-10-12 Thread Philippe Verdy
From: Clark Cox [EMAIL PROTECTED] unless the file was used as a test for CESU-8 The whole point of the CESU-8-like section is that it is not legal UTF-8. Except that the document does not even cite CESU-8 but only UTF-16! The text itself is puzzling as well as nearly all its suggestions about

Re: UTF-8 stress test file?

2004-10-12 Thread Philipp Reichmuth
Philippe Verdy schrieb: Examples of bad assumptions that a reader could make: - [quote](...) Experience so far suggests that most first-time authors of UTF-8 decoders find at least one serious problem in their decoder by using this file.[/quote] This suggests to the reader that if its browser or

Re: UTF-8 stress test file?

2004-10-12 Thread Philippe Verdy
From: Philipp Reichmuth [EMAIL PROTECTED] Don't you think you are stretching things a bit? This is an UTF-8 parser stress test file. If an application opens it in a different encoding, well, of course the results will be different, and things will not look UTF-8-ish. Again, this is a

Re: UTF-8 stress test file?

2004-10-11 Thread Philippe Verdy
From: Terje Bless [EMAIL PROTECTED] -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Theodore H. Smith [EMAIL PROTECTED] wrote: I'd like to see a UTF-8 stress test file. The top result on Google for the query UTF-8 Stress Test is http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt. This test

Re: UTF-8 stress test file?

2004-10-11 Thread Theodore H. Smith
Thanks Phillippe, in that file, all UTF-8 sequences with 5 bytes or more are invalid (they are not boundary cases). Thanks. So the list of impossible bytes is longer than documented there. Is it just a case of moving the boundary cases into the impossible bytes? Or are there impossible bytes

Re: UTF-8 stress test file?

2004-10-10 Thread Terje Bless
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Theodore H. Smith [EMAIL PROTECTED] wrote: I'd like to see a UTF-8 stress test file. The top result on Google for the query UTF-8 Stress Test is http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt. HTH, HAND. -link - -- I suggest you

Re: UTF-8 stress test file?

2004-10-10 Thread Simon Montagu
Theodore H. Smith wrote: I'd like to see a UTF-8 stress test file. It should consist of lines of UTF-8, separated each by a newline. Each line should be malformed. Also, some idea of how to deal with the malformed UTF-8 should be noted in a separate file. Really, I just want some way to verify

Re: UTF-8 stress test file?

2004-10-10 Thread Theodore H. Smith
I'd like to see a UTF-8 stress test file. It should consist of lines of UTF-8, separated each by a newline. Each line should be malformed. Also, some idea of how to deal with the malformed UTF-8 should be noted in a separate file. Really, I just want some way to verify that I can detect every