Andreas Jaekel <[EMAIL PROTECTED]> writes:
>Dear Perl Dieties!
>
>I've been trying to figure this out for myself for a couple
>of hours now, but I got to the point were I gave up and decided
>that I'll have to bother you. Hope you don't mind.
>
>My task is the following, and I'm running out of ideas:
>
>// what I want to do //
>
>I want to read in a GNU tar file from STDIN. The tar file
>contains roughly 1.2 million files, and each of them is
>encoded in either ASCII, UTF-8 or ISO-8859-1. The trick is,
>I don't know which file is encoded in which encoding.
>
>So, I only have one file descriptor (the tar archive), from which
>I successfully retreive each file into a scalar, one at a time,
>and then I call my "guess_enconding()" subroutine.
>
>// what I tried //
>
>>From this point on I'll describe in a few words what I already
>found out, and why it didn't help me.
>
>I found out I can set the file descriptor of the tar file to
>binmode(), or open it with <:bytes. I do that. But
>all it does it tell perl that the data is 8-bit raw. That resolved
>a few confusions, but not the final problem.
>
>I found out how to detect ASCII. I can do it with
> eval {
> Encode::from_to($buf, "ascii", "utf-8", Encode::FB_CROAK);
> }
> if($@) { ...
>
>But that leaves me with knowing UTF-8 from ISO-8859-1.
or
if ($buf =~ /^[\0x00-\x7f]*$/)
>
>Obviously, every UTF-8 file is also a valid ISO-8859-1 file. So my
>only hope is to check for "valid UTF-8", and if that fails it has to be
>ISO-8859-1.
How about
Encode::from_to($buf, "utf-8", "utf-8", Encode::FB_CROAK);
But that is doing a lot of work.
>
>The "perluniintro" man page gives example code on how to do that:
>
> use Encode 'encode_utf8';
> if(encode_utf8($buf)) { ...
That is the wrong way round. You have raw octets you want to see
if they are characters.
So you want to _decode_ them and see if it works.
>
>Unfortunatly, this plain doesn't work. The same man page mentions a
>second method:
>
> use warnings;
> @chars = unpack("U0U*", $buf);
>
>This WORKS (hurray!) but all I get is a warning, and I have not been
>able to find any way of detecting this warning inside my script.
>(short from parsing my own stderr, which would be creative, but
> I'd be shot if anyone saw my code - and rightshously so)
>
>I tried all other ways I could think of using encode(), decode(),
>from_to() and unpack(). I tried Encode::FB_CROAK wherever I could.
>
>Side note: despite the modules documentation stating otherwise,
>the function encode_utf8() would not accept a CHECK parameter.
The current docs say decode_utf8() accepts CHECK - as decode can
get a bad sequence of octets which don't make chars.
But encode cannot fail. If I have chars I _can_ encode them as UTF-8.
>
>So, I think it boils down to: "Is this string valid UTF-8? The
>methods given in the module documentations and man pages do no
>seem to work for me."
>
>I would appreciate any help.
>
>Thank you!
>
>Regards,
> Andy.