Re: Subject Unicode

Harry Wahl Fri, 10 Jan 2014 06:37:55 -0800

You could use the "BOM" UTF characters to determine whether a file is UTF or 
not, and what form of UTF (UTF-8, UTF-16, UTF-32, big-edian or little-edian) is 
being used. 
The BOM characters are the UTF defined characters usually inserted 
transparently at the beginning of a UTF file. Granted this is not a perfect 
answer, but it may help for want of any other way to determine if a file is UTF 
or not.
However, BOM characters are not always present, some platforms always have them 
(Microsoft) and some platforms eschew them. Windows Notepad is particularly 
tricky because it adds them without you realizing it. So whether you look at a 
file with Notepad (or other simple editor) or don't can both affect your 
results and cause you to question your sanity because you didn't realize this.
BOM characters can be very useful. For example an XML header defines character 
encoding, but BOM characters can be used to determine the character encoding of 
the XML header itself.
For UTF-8 the BOM character can be used to determine if a file is UTF encoded 
or not. But, for UTF-16 and UTF-32, it also allows you to determine the 
edianness of the UTF code units.
Harry


> Date: Fri, 10 Jan 2014 08:01:42 +0000
> From: peter.hunke...@credit-suisse.com
> Subject: Re: Subject Unicode
> To: IBM-MAIN@LISTSERV.UA.EDU
> 
> >Other than with a lot of inferential cleverness, there is no way to look at 
> >an "ASCII-like" file and tell what the code page is. 
> 
> The same applies to data encoded in EBCDIC. In fact, files are nothing but a 
> series of bytes. You always need to know what those byes represent in order 
> to be able work on the in a meaningful way. 
> 
> Especially in the distributed world, some conventions have been established 
> that help programs in guessing what the file content might be. The first 
> couple of bytes contain a certain byte sequence to identify the type of the 
> file. But still, there is no guarantee the rest of the file matches that 
> indication. Unfortunately, no such convention exists for pure text data. 
> Neither a convention to indicate this is text nor to tell the encoding / code 
> page used.
> 
> --
> Peter Hunkeler
> 
> ----------------------------------------------------------------------
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
                                          
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Subject Unicode

Reply via email to