On Sat, Jul 16, 2005 at 03:48:10PM +0200, H.Merijn Brand wrote: > On Sat, 16 Jul 2005 22:05:13 +0900, SADAHIRO Tomoyuki <[EMAIL PROTECTED]> > wrote: > > > > This is a bug report for perl from [EMAIL PROTECTED], > > > generated with the help of perlbug 1.35 running under perl v5.8.4. > > > > > > I ran into this, and wondered if it is a bug. > > > > > > I have tested on perl 5.8.4 with Encode.pm version 1.99_01 (from > > > Debian package) and 2.10 (from CPAN). > > > > Thanks for the report. > > Thanks for the fast patch. Applied as change #25158 > > > utf8_to_uvchr((U8*)s, 0) used in do_chop() returns 0, > > not only if the octet sequence from *s is malformed, > > but also if *s == '\0'. The return value 0 should be > > for U+0000 (NUL) rather than malformedness. Oops :-< > > > > P.S. by the way, when the string in utf8 ends with malformed > > octet(s), how should chop() do? > > It has returned undef without modification of the string. > > Seems reasonable, though just cutting off one byte of the string would maybe > more of an expected behaviour. Maybe
Was there more to that sentence? I'd vote for removing and returning a malformed char, from the last non continuation byte on (or just the unexpected continuation bytes, if the problem was too many of them). That way, the data error is propagated onto the return value (as IMO it should be), and a full-buffer problem will result in at most one bad char. In fact, I could see being able to rely on this being advantageous to buffering code (both XS and perl): fill buffer with bytes chop char and set aside process buffer move choped char to start of buffer repeat