Re: Don't use the \C escape in regexes - Why not?

Aristotle Pagaltzis Tue, 04 May 2010 10:46:17 -0700

* Michael Ludwig <michael.lud...@xing.com> [2010-05-04 14:55]:
> But wait a second: While URIs are meant to be made of
> characters, they're also meant to go over the wire, and there
> are no characters on the wire, only bytes. There is no standard
> encoding defined for the wire, although UTF-8 has come to be
> seen as the standard encoding for URIs containing non-ASCII
> characters. Perl having two standard encodings (UTF-8 and
> ISO-8859-1) for text and relying on the internal flag to tell
> which one is meant to matter, shouldn't the URI module either
> only accept bytes or only characters? Or rather, provide two
> different constructors instead of only one trying to be
> intelligent?
>
>  URI->bytes( $bytes ); # byte string
>  URI->chars( $chars ); # character string
>
> And, in addition, define the character encoding used for
> serialization.


Yes, exactly. And both methods would use the moral equivalent of
a plain `split //` – no trickery such as with `\C`. The only
difference between then is that the `chars` method would
`encode_utf8` the string first and then encode it blindly,
whereas the `bytes` method would leave it as is but then croak if
it found a codepoint > 0xFF (since the string is supposed to
represent an octet sequence already).

Notably absent in both cases: any dependence on the state of the
UTF8 flag of the string.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Re: Don't use the \C escape in regexes - Why not?

Reply via email to