I regret that I let \C sneak into the URI module. Now we have an interface that depends on the internal UTF-8 flag of the stings passed in. This makes it very hard to explain, makes it not do what you want when different type of strings are combined and makes it hard to fix in ways that don't break some code. My plan for fixing this is to introduce URI::IRI with an interface that encode all non-URI characters as percent-encoded UTF-8 and live with the inconsistency for URI (until Perl redefine what \C means).
--Gisle On May 3, 2010, at 20:34, Michael Ludwig wrote: > "Don't use the \C escape in regexes" - taken from Juerd's Unicode Advice page: > > http://juerd.nl/site.plp/perluniadvice > > Why not? > > ------ perldoc perlre: > \C Match a single C char (octet) even under Unicode. > NOTE: breaks up characters into their UTF-8 bytes, > so you may end up with malformed pieces of UTF-8. > Unsupported in lookbehind. > > ------ URI::Escape > sub escape_char { > return join '', @URI::Escape::escapes{$_[0] =~ /(\C)/g}; > } > > The regular expression is used to disassemble an incoming text string into > individual bytes (and then use the resulting list in a hash slice). It is a > legitimate use case, and the means seems to do the job. What's the problem > with the \C escape? > > -- > Michael.Ludwig (#) XING.com >