Re: Perl & unicode weirdness.

Rob Park Sat, 31 Jan 2004 14:47:36 -0800

Markus Kuhn wrote:

The way in which Perl supports Unicode, you normally should hardly ever
have to call a UTF-8 encoder or decoder explicitely and manually. You
just have to make sure that when a UTF-8 string enters Perl, it does so
tagged as a UTF-8 string and not as an octet string. How that happens
depends on how the string gets into Perl. When opening files, for
instance, you can tell Perl the charset to expect or to look at the
LC_CTYPE locale.

I've been told that I'm not supposed to ever need to manually call unicode encoders or decoders, but it was the only way I could make the script work.

My script takes filenames as arguments (normally provided by globbing in bash), then it does some manipulations on the filename, and it pipes some output to vorbiscomment. Basically, my music collection stores all the metadata in the filesystem itself, and this script is a bit of glue for moving the metadata out of the filesystem and into the file, for music apps that expect the metadata to be in the file.

What happens is, the filenames end up having non-ASCII characters in them, as some of my music isn't English.

At first, the script was very simple. It would parse the filename, then pipe the info to vorbiscomment, which would store the info. The problem was, XMMS would display (for example) a "u with umlaut" as being two characters, "A with tilde" and "1/4", or something similar. And it would do that with ALL non-ASCII characters. Finding a solution nearly drove me nuts, as writing the data manually into vorbiscomment worked fine.

So then I started futzing around with 'decode_utf8', and just like magic, everything worked. Before I used that function, perl couldn't even print the filenames properly without munging the unicode. It's been a while, but I'm relatively certain I was using perl 5.8.0 at the time.

Now, after months of having the script "just work", all the unicode characters were suddenly turning into the '#' character for no reason. I started futzing around with my script again, trying to solve this problem, and removing all calls to 'decode_utf8' fixed everything up. Now everything "Just Works" like it ought to have been way back when I originally created the script. As far as I know, I haven't upgraded perl since then.

Like I was saying, I don't have a problem: the script works! I was just wondering what the hell happened that suddenly "decode_utf8" was no longer necessary.

Question: What is a quick way in Perl to get a regular expression that
matches all Unicode characters in the range U0100..U10FFFF, in other
words all non-ASCII Unicode characters?

Would it be possible to do something like this:

/[\x{0x0100}-\x{0x10FFFF}]/

I dunno, it makes sense in my head, but I haven't tested it. If this doesn't work, you might want to try something with 'ord' and testing if the resultant number is within that range.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Perl & unicode weirdness.

Reply via email to