On Mon, 18 Jul 2005 20:33:28 -0700, Yitzchak Scott-Thoennes
<[EMAIL PROTECTED]> wrote:

> On Sat, Jul 16, 2005 at 03:48:10PM +0200, H.Merijn Brand wrote:
> > On Sat, 16 Jul 2005 22:05:13 +0900, SADAHIRO Tomoyuki <[EMAIL PROTECTED]>
> > > utf8_to_uvchr((U8*)s, 0) used in do_chop() returns 0,
> > > not only if the octet sequence from *s is malformed,
> > > but also if *s == '\0'. The return value 0 should be
> > > for U+0000 (NUL) rather than malformedness.  Oops :-<
> > > 
> > > P.S. by the way, when the string in utf8 ends with malformed
> > > octet(s), how should chop() do?
> > > It has returned undef without modification of the string.
> > 
> > Seems reasonable, though just cutting off one byte of the string would
> > maybe more of an expected behaviour. Maybe
> 
> Was there more to that sentence?

No, I stopped after maybe. Because the more I thought about it, the less
certain I was about *any* opinion I might have. I decided to leave that to
the utf8 experts

> I'd vote for removing and returning a malformed char, from the last
> non continuation byte on (or just the unexpected continuation bytes,
> if the problem was too many of them).
> 
> That way, the data error is propagated onto the return value (as IMO
> it should be), and a full-buffer problem will result in at most one
> bad char.  In fact, I could see being able to rely on this being
> advantageous to buffering code (both XS and perl):
> 
>    fill buffer with bytes
>    chop char and set aside
>    process buffer
>    move choped char to start of buffer
>    repeat

-- 
H.Merijn Brand        Amsterdam Perl Mongers (http://amsterdam.pm.org/)
using Perl 5.6.2, 5.8.0, 5.8.5, & 5.9.2  on HP-UX 10.20, 11.00 & 11.11,
 AIX 4.3 & 5.2, SuSE 9.2 & 9.3, and Cygwin. http://www.cmve.net/~merijn
Smoking perl: http://www.test-smoke.org,    perl QA: http://qa.perl.org
 reports  to: [EMAIL PROTECTED],                perl-qa@perl.org

Reply via email to