Andreas Jaekel <[EMAIL PROTECTED]> writes: >Dear Perl Dieties! > >I've been trying to figure this out for myself for a couple >of hours now, but I got to the point were I gave up and decided >that I'll have to bother you. Hope you don't mind. > >My task is the following, and I'm running out of ideas: > >// what I want to do // > >I want to read in a GNU tar file from STDIN. The tar file >contains roughly 1.2 million files, and each of them is >encoded in either ASCII, UTF-8 or ISO-8859-1. The trick is, >I don't know which file is encoded in which encoding. > >So, I only have one file descriptor (the tar archive), from which >I successfully retreive each file into a scalar, one at a time, >and then I call my "guess_enconding()" subroutine. > >// what I tried // > >>From this point on I'll describe in a few words what I already >found out, and why it didn't help me. > >I found out I can set the file descriptor of the tar file to >binmode(), or open it with <:bytes. I do that. But >all it does it tell perl that the data is 8-bit raw. That resolved >a few confusions, but not the final problem. > >I found out how to detect ASCII. I can do it with > eval { > Encode::from_to($buf, "ascii", "utf-8", Encode::FB_CROAK); > } > if($@) { ... > >But that leaves me with knowing UTF-8 from ISO-8859-1.
or if ($buf =~ /^[\0x00-\x7f]*$/) > >Obviously, every UTF-8 file is also a valid ISO-8859-1 file. So my >only hope is to check for "valid UTF-8", and if that fails it has to be >ISO-8859-1. How about Encode::from_to($buf, "utf-8", "utf-8", Encode::FB_CROAK); But that is doing a lot of work. > >The "perluniintro" man page gives example code on how to do that: > > use Encode 'encode_utf8'; > if(encode_utf8($buf)) { ... That is the wrong way round. You have raw octets you want to see if they are characters. So you want to _decode_ them and see if it works. > >Unfortunatly, this plain doesn't work. The same man page mentions a >second method: > > use warnings; > @chars = unpack("U0U*", $buf); > >This WORKS (hurray!) but all I get is a warning, and I have not been >able to find any way of detecting this warning inside my script. >(short from parsing my own stderr, which would be creative, but > I'd be shot if anyone saw my code - and rightshously so) > >I tried all other ways I could think of using encode(), decode(), >from_to() and unpack(). I tried Encode::FB_CROAK wherever I could. > >Side note: despite the modules documentation stating otherwise, >the function encode_utf8() would not accept a CHECK parameter. The current docs say decode_utf8() accepts CHECK - as decode can get a bad sequence of octets which don't make chars. But encode cannot fail. If I have chars I _can_ encode them as UTF-8. > >So, I think it boils down to: "Is this string valid UTF-8? The >methods given in the module documentations and man pages do no >seem to work for me." > >I would appreciate any help. > >Thank you! > >Regards, > Andy.