On Fri, 07 Jun 2013 04:53:42 -0700, Νικόλαος Κούρας wrote: > Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st > 0-127 codepoints similar?
You can answer this yourself. Open a terminal window and start a Python interactive session. Then try it and see what happens: s = ''.join(chr(i) for i in range(128)) bytes_as_utf8 = s.encode('utf-8') bytes_as_latin1 = s.encode('latin-1') bytes_as_greek_iso = s.encode('ISO-8859-7') bytes_as_ascii = s.encode('ascii') bytes_as_utf8 == bytes_as_latin1 == bytes_as_greek_iso == bytes_as_ascii What result do you get? True or False? And now you know the answer, without having to ask. > For example char 'a' has the value of '65' for all of those character > sets? Is hat what you mean? You can answer that question yourself. c = 'a' for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'): print(c.encode(encoding)) By the way, I believe that Python has made a strategic mistake in the way that bytes are printed. I think it leads to more confusion, not less. Better would be something like this: c = 'a' for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'): print(hex(c.encode(encoding)[0])) For historical reasons, most (but not all) charsets are supersets of ASCII. That is, the first 128 characters in the charset are the same as the 128 characters in ASCII. > s = 'a' (This is unicode right? Why when we assign a string to a > variable that string's type is always unicode Strings in Python 3 are Unicode strings. That's just the way Python works. Unicode was chosen because Unicode includes over a million different characters (well, potentially over a million, most of them are currently unused), and is a strict superset of *all* common legacy codepages from the old DOS and Windows 95 days. > and does not automatically > become utf-8 which includes all available world-wide characters? Unicode > is something different that a character set? ) Unicode is a character set. It is an enormous set of over one million characters (technically "code point", but don't worry about the difference right now) which can be collected in strings. UTF-8 is an encoding that goes from a string using the Unicode character set into bytes, and back again. Sometimes, people are lazy and say "UTF-8" when they mean "Unicode", or visa versa. UTF-16 and UTF-32 are two different encodings for the same purpose, but for various technical reasons UTF-8 is better for files. 'λ' is a character which exists in some charsets but not others. It is not in the ASCII charset, nor is it in Latin-1, nor Big-5. It is in the ISO-8859-7 charset, and of course it is in Unicode. In ISO-8859-7, the character 'λ' is stored as byte 0xEB (decimal 235), just as the character 'a' is stored as byte 0x61 (decimal 97). In UTF-8, the character λ is stored as two bytes 0xCE 0xBB. In UTF-16 (big-endian), the character λ is stored as two bytes 0x03 0xBB. In UTF-32 (big-endian), the character λ is stored as four bytes 0x00 0x00 0x03 0xBB. That's four different ways of "spelling" the same character as bytes, just as "three", "trois", "drei", "τρία", "três" are all different ways of spelling the same number 3. > utf8_byte = s.encode('utf-8') > > Now if we are to decode this back to utf8 we will receive the char 'a'. > I beleive same thing will happen with latin, greek, ascii isos. Correct? Why don't you try it for yourself and see? > The characters that will not decode correctly are those that their > codepoints are greater that > 127 ? Maybe, maybe not. It depends on which codepoint, and which encodings. Some encodings use the same bytes for the same characters. Some encodings use different bytes. It all depends on the encoding, just like American and English both spell 3 "three", while French spells it "trois". > for example if s = 'α' (greek character equivalent to english 'a') In Latin-1, 'α' does not exist: py> 'α'.encode('latin-1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode character '\u03b1' in position 0: ordinal not in range(256) In the old Windows Greek charset, ISO-8859-7, 'α' is stored as byte 0xE1: py> 'α'.encode('ISO-8859-7') b'\xe1' But in the old Windows *Russian* charset, ISO-8859-5, the byte 0xE1 means a completely different character, CYRILLIC SMALL LETTER ES: py> b'\xE1'.decode('ISO-8859-5') 'с' (don't be fooled that this looks like the English c, it is not the same). In Unicode, 'α' is always codepoint 0x3B1 (decimal 945): py> ord('α') 945 but before you can store that on a disk, or as a file name, it needs to be converted to bytes, and which bytes you get depends on which encoding you use: py> 'α'.encode('utf-8') b'\xce\xb1' py> 'α'.encode('utf-16be') b'\x03\xb1' py> 'α'.encode('utf-32be') b'\x00\x00\x03\xb1' -- Steven -- http://mail.python.org/mailman/listinfo/python-list