Re: Question regarding Unicode handling in perl: auto-sensing

Nick Ing-Simmons Sun, 22 Feb 2004 15:56:06 -0800

Andreas Jaekel <[EMAIL PROTECTED]> writes:
>Dear Perl Dieties!
>
>I've been trying to figure this out for myself for a couple
>of hours now, but I got to the point were I gave up and decided
>that I'll have to bother you.  Hope you don't mind.
>
>My task is the following, and I'm running out of ideas:
>
>// what I want to do //
>
>I want to read in a GNU tar file from STDIN.  The tar file
>contains roughly 1.2 million files, and each of them is
>encoded in either ASCII, UTF-8 or ISO-8859-1.  The trick is,
>I don't know which file is encoded in which encoding.
>
>So, I only have one file descriptor (the tar archive), from which
>I successfully retreive each file into a scalar, one at a time,
>and then I call my "guess_enconding()" subroutine.
>
>// what I tried //
>
>>From this point on I'll describe in a few words what I already
>found out, and why it didn't help me.
>
>I found out I can set the file descriptor of the tar file to
>binmode(), or open it with <:bytes.  I do that.  But
>all it does it tell perl that the data is 8-bit raw.  That resolved
>a few confusions, but not the final problem.
>
>I found out how to detect ASCII.  I can do it with
>       eval {
>               Encode::from_to($buf, "ascii", "utf-8", Encode::FB_CROAK);
>       }
>       if($@) { ...
>
>But that leaves me with knowing UTF-8 from ISO-8859-1.


or 

       if ($buf =~ /^[\0x00-\x7f]*$/)

>
>Obviously, every UTF-8 file is also a valid ISO-8859-1 file. So my
>only hope is to check for "valid UTF-8", and if that fails it has to be
>ISO-8859-1.

How about 

        Encode::from_to($buf, "utf-8", "utf-8", Encode::FB_CROAK);

But that is doing a lot of work.


>
>The "perluniintro" man page gives example code on how to do that:
>
>       use Encode 'encode_utf8';
>       if(encode_utf8($buf)) { ...

That is the wrong way round. You have raw octets you want to see 
if they are characters.
So you want to _decode_ them and see if it works.


>
>Unfortunatly, this plain doesn't work.  The same man page mentions a
>second method:
>
>       use warnings;
>       @chars = unpack("U0U*", $buf);
>
>This WORKS (hurray!) but all I get is a warning, and I have not been
>able to find any way of detecting this warning inside my script.
>(short from parsing my own stderr, which would be creative, but
> I'd be shot if anyone saw my code - and rightshously so)
>
>I tried all other ways I could think of using encode(), decode(),
>from_to() and unpack(). I tried Encode::FB_CROAK wherever I could.
>
>Side note:  despite the modules documentation stating otherwise,
>the function encode_utf8() would not accept a CHECK parameter.

The current docs say decode_utf8() accepts CHECK - as decode can 
get a bad sequence of octets which don't make chars.
But encode cannot fail. If I have chars I _can_ encode them as UTF-8.

>
>So, I think it boils down to: "Is this string valid UTF-8?  The
>methods given in the module documentations and man pages do no
>seem to work for me."
>
>I would appreciate any help.
>
>Thank you!
>
>Regards,
>    Andy.

Re: Question regarding Unicode handling in perl: auto-sensing

Reply via email to