Dear Perl Dieties! I've been trying to figure this out for myself for a couple of hours now, but I got to the point were I gave up and decided that I'll have to bother you. Hope you don't mind.
My task is the following, and I'm running out of ideas: // what I want to do // I want to read in a GNU tar file from STDIN. The tar file contains roughly 1.2 million files, and each of them is encoded in either ASCII, UTF-8 or ISO-8859-1. The trick is, I don't know which file is encoded in which encoding. So, I only have one file descriptor (the tar archive), from which I successfully retreive each file into a scalar, one at a time, and then I call my "guess_enconding()" subroutine. // what I tried // >From this point on I'll describe in a few words what I already found out, and why it didn't help me. I found out I can set the file descriptor of the tar file to binmode(), or open it with <:bytes. I do that. But all it does it tell perl that the data is 8-bit raw. That resolved a few confusions, but not the final problem. I found out how to detect ASCII. I can do it with eval { Encode::from_to($buf, "ascii", "utf-8", Encode::FB_CROAK); } if($@) { ... But that leaves me with knowing UTF-8 from ISO-8859-1. Obviously, every UTF-8 file is also a valid ISO-8859-1 file. So my only hope is to check for "valid UTF-8", and if that fails it has to be ISO-8859-1. The "perluniintro" man page gives example code on how to do that: use Encode 'encode_utf8'; if(encode_utf8($buf)) { ... Unfortunatly, this plain doesn't work. The same man page mentions a second method: use warnings; @chars = unpack("U0U*", $buf); This WORKS (hurray!) but all I get is a warning, and I have not been able to find any way of detecting this warning inside my script. (short from parsing my own stderr, which would be creative, but I'd be shot if anyone saw my code - and rightshously so) I tried all other ways I could think of using encode(), decode(), from_to() and unpack(). I tried Encode::FB_CROAK wherever I could. Side note: despite the modules documentation stating otherwise, the function encode_utf8() would not accept a CHECK parameter. So, I think it boils down to: "Is this string valid UTF-8? The methods given in the module documentations and man pages do no seem to work for me." I would appreciate any help. Thank you! Regards, Andy.