On Mon, 18 Jul 2005 20:33:28 -0700, Yitzchak Scott-Thoennes <[EMAIL PROTECTED]> wrote:
> On Sat, Jul 16, 2005 at 03:48:10PM +0200, H.Merijn Brand wrote: > > On Sat, 16 Jul 2005 22:05:13 +0900, SADAHIRO Tomoyuki <[EMAIL PROTECTED]> > > > utf8_to_uvchr((U8*)s, 0) used in do_chop() returns 0, > > > not only if the octet sequence from *s is malformed, > > > but also if *s == '\0'. The return value 0 should be > > > for U+0000 (NUL) rather than malformedness. Oops :-< > > > > > > P.S. by the way, when the string in utf8 ends with malformed > > > octet(s), how should chop() do? > > > It has returned undef without modification of the string. > > > > Seems reasonable, though just cutting off one byte of the string would > > maybe more of an expected behaviour. Maybe > > Was there more to that sentence? No, I stopped after maybe. Because the more I thought about it, the less certain I was about *any* opinion I might have. I decided to leave that to the utf8 experts > I'd vote for removing and returning a malformed char, from the last > non continuation byte on (or just the unexpected continuation bytes, > if the problem was too many of them). > > That way, the data error is propagated onto the return value (as IMO > it should be), and a full-buffer problem will result in at most one > bad char. In fact, I could see being able to rely on this being > advantageous to buffering code (both XS and perl): > > fill buffer with bytes > chop char and set aside > process buffer > move choped char to start of buffer > repeat -- H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/) using Perl 5.6.2, 5.8.0, 5.8.5, & 5.9.2 on HP-UX 10.20, 11.00 & 11.11, AIX 4.3 & 5.2, SuSE 9.2 & 9.3, and Cygwin. http://www.cmve.net/~merijn Smoking perl: http://www.test-smoke.org, perl QA: http://qa.perl.org reports to: [EMAIL PROTECTED], perl-qa@perl.org