On Tue, 2004-05-25 at 17:43, Ehsan Akhgari wrote:
> Well, maybe you're right, but I don't see how a text editor is supposed to
> know the encoding of a file without some kind of mark. 

Does Latin-1 (an old encoding of text files for Western Europe, also
called ISO 8859-1) had a mark to distinguish it from, say, CP1256 (an
old MS encoding for Arabic language)? Did ASCII have a mark? No. Text
files are text files. They are not supposed to have marks to distinguish
their character set.

The character set of a text file should be in the metadata (file name,
file system, environment variable, HTTP header, MIME header, ...) or it
should be auto-detected (UTF-8 is really easy to detect, since it has a
very regular mathematical pattern, UTF-16 is also easy to detect, since
it's recommended that it has a BOM), or it should be specified by the
user when he is opening a file.

> Plain text files have no means of
> identifying the character encoding,

That is somehow true. Plain text files have *sometimes* no means of
identifying the character encoding *by themselves*.

> so a single text file can be interpreted
> as UTF-7, UTF-8, UTF-16, UTF-32, etc. if there's nothing to declare the
> exact character encoding used.

UTF-7 is deprecated. UTF-16 and UTF-32 *do* have BOM marks in the
standards defining them, so it's OK if they use a BOM. UTF-8 doesn't
have that. Nor does ASCII, CP1256, Latin-1, etc.

> The point here is that, protocols which do not allow BOM are those who
> provide other means of specifying the character encoding.

The point is that Notepad doesn't add a mark to Latin-1 or CP1256, why
should it add one to UTF-8?!

> A certain byte
> stream can have multiple interpretations depending on what content encoding
> you use to interpret it, and there must be some way to cut off this
> confusion.

Yes, by either Metadata, auto-detection, or specific selection.

roozbeh


_______________________________________________
PersianComputing mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/persiancomputing

Reply via email to