You could use the "BOM" UTF characters to determine whether a file is UTF or not, and what form of UTF (UTF-8, UTF-16, UTF-32, big-edian or little-edian) is being used. The BOM characters are the UTF defined characters usually inserted transparently at the beginning of a UTF file. Granted this is not a perfect answer, but it may help for want of any other way to determine if a file is UTF or not. However, BOM characters are not always present, some platforms always have them (Microsoft) and some platforms eschew them. Windows Notepad is particularly tricky because it adds them without you realizing it. So whether you look at a file with Notepad (or other simple editor) or don't can both affect your results and cause you to question your sanity because you didn't realize this. BOM characters can be very useful. For example an XML header defines character encoding, but BOM characters can be used to determine the character encoding of the XML header itself. For UTF-8 the BOM character can be used to determine if a file is UTF encoded or not. But, for UTF-16 and UTF-32, it also allows you to determine the edianness of the UTF code units. Harry
> Date: Fri, 10 Jan 2014 08:01:42 +0000 > From: peter.hunke...@credit-suisse.com > Subject: Re: Subject Unicode > To: IBM-MAIN@LISTSERV.UA.EDU > > >Other than with a lot of inferential cleverness, there is no way to look at > >an "ASCII-like" file and tell what the code page is. > > The same applies to data encoded in EBCDIC. In fact, files are nothing but a > series of bytes. You always need to know what those byes represent in order > to be able work on the in a meaningful way. > > Especially in the distributed world, some conventions have been established > that help programs in guessing what the file content might be. The first > couple of bytes contain a certain byte sequence to identify the type of the > file. But still, there is no guarantee the rest of the file matches that > indication. Unfortunately, no such convention exists for pure text data. > Neither a convention to indicate this is text nor to tell the encoding / code > page used. > > -- > Peter Hunkeler > > ---------------------------------------------------------------------- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN