On Tue, 2004-05-25 at 17:43, Ehsan Akhgari wrote: > Well, maybe you're right, but I don't see how a text editor is supposed to > know the encoding of a file without some kind of mark.
Does Latin-1 (an old encoding of text files for Western Europe, also called ISO 8859-1) had a mark to distinguish it from, say, CP1256 (an old MS encoding for Arabic language)? Did ASCII have a mark? No. Text files are text files. They are not supposed to have marks to distinguish their character set. The character set of a text file should be in the metadata (file name, file system, environment variable, HTTP header, MIME header, ...) or it should be auto-detected (UTF-8 is really easy to detect, since it has a very regular mathematical pattern, UTF-16 is also easy to detect, since it's recommended that it has a BOM), or it should be specified by the user when he is opening a file. > Plain text files have no means of > identifying the character encoding, That is somehow true. Plain text files have *sometimes* no means of identifying the character encoding *by themselves*. > so a single text file can be interpreted > as UTF-7, UTF-8, UTF-16, UTF-32, etc. if there's nothing to declare the > exact character encoding used. UTF-7 is deprecated. UTF-16 and UTF-32 *do* have BOM marks in the standards defining them, so it's OK if they use a BOM. UTF-8 doesn't have that. Nor does ASCII, CP1256, Latin-1, etc. > The point here is that, protocols which do not allow BOM are those who > provide other means of specifying the character encoding. The point is that Notepad doesn't add a mark to Latin-1 or CP1256, why should it add one to UTF-8?! > A certain byte > stream can have multiple interpretations depending on what content encoding > you use to interpret it, and there must be some way to cut off this > confusion. Yes, by either Metadata, auto-detection, or specific selection. roozbeh _______________________________________________ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing