On 1/16/2012 9:52 PM, Bjoern Hoehrmann wrote:
> * Christopher J. Madsen wrote:
>>> Your UTF-8 validation code seems wrong to me, you consider the sequence
>>> F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
>>> the chart in <http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design>.
>>
>> I guess the RE could be improved, but I'm not sure it's worth the effort
>> and added complication to catch a tiny fraction of false positives.
> 
> Why make the check at all if you don't care if it's right?

I can't use a simple utf8::decode check, because I read a fixed number
of bytes, and that might have cut a multi-byte character in half.  So I
use Encode::FB_QUIET, and then check the leftovers to make sure that
it's a single, plausible, partial UTF-8 character.  I have to check the
leftovers, or the whole test would be meaningless.  I just make sure
it's a start byte followed by an appropriate number of continuation bytes.

As you say, certain start bytes can't validly be followed by certain
continuation bytes, but writing an RE for those rules is more complexity
than I think the problem warrants.  What are the odds that I had 1021
bytes of valid UTF-8 (including at least 1 multi-byte character)
followed by bytes that match my current RE but a strict test could have
rejected?  I'm already just assuming that the next bytes would be
additional continuation bytes.

>>> Anyway, if people think this is the way to go, maybe HTTP::Message can
>>> adopt the Content-Type header charset extraction tests in HTML::Encoding
>>> so they don't get lost as my module becomes redundant?
>>
>> I thought it already did that?
> 
> Not as far as I can tell; links welcome though.

At the beginning of content_charset, it calls content_type_charset
(which is actually a HTTP::Headers method).

Or were you talking about t/01http.t and its associated input files?

-- 
Chris Madsen                                          p...@cjmweb.net
  --------------------  http://www.cjmweb.net  --------------------

Reply via email to