On 07Jun2013 09:56, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr...@gmail.com> wrote: | On 7/6/2013 4:01 πμ, Cameron Simpson wrote: | >On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr...@gmail.com> wrote: | >| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: | >| > py> s = '999-Eυχή-του-Ιησού' | >| > py> bytes_as_utf8 = s.encode('utf-8') | >| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace') | >| > py> print(t) | >| > 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ | >| | >| errors='replace' mean dont break in case or error? | > | >Yes. The result will be correct for correct iso-8859-7 and slightly mangled | >for something that would not decode smoothly. | | How can it be correct? We have encoded out string in utf-8 and then | we tried to decode it as greek-iso? How can this possibly be | correct?
Ok, not correct. But consistent. Safe to call. If it is a valid iso-8859-7 sequence (which might cover everything, since I expect it is an 8-bit 1:1 mapping from bytes values to a set of codepoints, just like iso-8859-1) then it may decode to the "wrong" characters, but the reverse process (characters encoded as bytes) should produce the original bytes. With a mapping like this, errors='replace' may mean nothing; there will be no errors because the only Unicode characters in play are all from iso-8859-7 to start with. Of course another string may not be safe. | >| You took the unicode 's' string you utf-8 bytestringed it. | >| Then how its possible to ask for the utf8-bytestring to decode | >| back to unicode string with the use of a different charset that the | >| one used for encoding and thsi actually printed the filename in | >| greek-iso? | > | >It is easily possible, as shown above. Does it make sense? Normally | >not, but Steven is demonstrating how your "mv" exercises have | >behaved: a rename using utf-8, then a _display_ using iso-8859-7. | | Same as above, i don't understand it at all, since different | charsets(encodings) used in the encode/decode process. Visually, the names will be garbage. And if you go: mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3' while using the iso-8859-7 locale, the wrong thing will occur (assuming it even works, though I think it should because all these characters are represented in iso-8859-7, yes?) Why? In the iso-8859-7 locale, your (currently named under an utf-8 regime) file looks like '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' (because the Unicode byte sequence maps to those characters in iso-8859-7). Why you issue the about "mv" command, the new name will be the _iso-8859-7_ bytes encoding for '999-Eυχή-του-Ιησού.mp3'. Which, under an utf-8 regime will decode to _other_ characters. If you want to repair filenames, by which I mean, cause them to be correctly encoded for utf-8, you are best to work in utf-8 (using "mv" or python). Of course, the badly named files will then look wrong in your listing. If you _know_ the filenames were written using iso-8859-7 encoding, and that the names are "right" under that encoding, you can write python code to rename them to utf-8. Totally untested example code: import sys from binascii import hexlify for bytename in os.listdir( b'.' ): unicode_name = bytename.decode('iso-8859-7') new_bytename = unicode_name.encode('utf-8') print("%s: %s => %s" % (unicode_name, hexlify(bytename), hexlify(new_bytename)), file=sys.stderr) os.rename(bytename, new_bytename) That code should not care what locale you are using because it uses bytes for the file calls and is explicit about the encoding/decoding steps. | >| a) WHAT does it mean when a linux system is set to use utf-8? | > | >It means the locale settings _for the current process_ are set for | >UTF-8. The "locale" command will show you the current state. | | That means that, when a linux application needs to saved a filename | to the linux filesystem, the app checks the filesytem's 'locale', so | to encode the filename using the utf-8 charset ? At the command line, many will not. They'll just read and write bytes. Some will decode/encode. Those that do, should by default use the current locale. But broadly, it is GUI apps that care about this because they must translate byte sequences to glyphs: images of characters. So plenty of command line tools do not need to care; the terminal application is the one that presents the names to you; _it_ will decode them for display. And it is the terminal app that translates your keystrokes into bytes to feed to the command line. NOTE: it is NOT the filesystem's locale. It is the current process' locale, which is deduced from environment variables (which have defaults if they are not set). Under Windows I believe filesystems have locales; this can prevent you storing some files on some filesystems on Windows, because the filesystem doesn't cope. UNIX just takes bytes. | And likewise when a linux application wants to decode a filename is | also checking the filesystem's 'locale' setting so to know what | charset must use to decode the filename correctly back to the | original string? Again, NOT the filesystem's locale. The process' locale. The filesystem filenames are just bytes. | So locale is used for filesystem itself and linux apps to know how | to read(decode) and write(enode) filenames from/into the system's | hdd? NOT THE FILESYSTEM LOCALE. There is no filesystem locale. If you look at: http://docs.python.org/3/library/sys.html#sys.getfilesystemencoding you'll see if does not talk about a property of the filesystem, but the behaviour that will be used when storing filenames. | >| c) WHAT happens when the two of them try to work together? | > | >If everything matches, it is all good. If the locales do not match, | >the mismatch will result in an undesired bytes<->characters | >encode/decode step somewhere, and something will display incorrectly | >or be entered as input incorrectly. | | Cant quite grasp the idea: | | local end: Win8, locale = greek-iso | remote end: CentOS 6.4, locale = utf-8 What makes you think the remote end is utf-8? When you say "locale = utf-8", _exactly_ what does that mean to you? | FileZilla by default uses "do not know what charset" to upload filenames Then at a guess it uploaded the filenames as greek-iso byte sequences. The filenames on disc will be greek-iso byte sequences. | Putty by default uses greek-iso to display filenames Then it will look ok, superficially, I would expect. | WHAT someone can expect to happen when all of the above work together? | Mess of course, but i want to hear in detail each step of the mess | as it emerges. There are several steps, for example: FileZilla will pass filenames to the remote end (FTP, SFTP, maybe) as bytes. What those bytes will be will depend on FileZilla. The UNIX end probably accepts them as-is and uses them directly. So the filenames on disc would probably be greek-iso byte sequences. Running a /bin/ls ("ls" without the alias, with no special options) should present these byte sequences to the Terminal, which will decode them using its locale (greek-iso?) Running a "/bin/ls -b" (using the -b option from the ls alias) will "print octal escapes for nongraphic characters". So "ls" must decide what are nongraphic characters. It does this by decoding the filenames using the _remote_ locale (its own locale). So it will decode the greek-iso byet sequences as though they were utf-8. Anything in the ASCII range (1-127, which will represent the same characters in utf-8, iso-8859-1 or iso-8859-7), the boring Roman alphabet range, will be treated the same. But outside that range the byte sequence will be taken to mean different characters depending on the locale. So "ls -b" will decide some of the greek-iso byte sequences do not represent printable characters, and will decide to print octal. Experiment: LC_ALL=C ls -b LC_ALL=utf-8 ls -b LC_ALL=iso-8859-7 ls -b And the Terminal itself is decoding the output for display, and encoding your input keystrokes to feed as input to the command line. You would be best setting your Windows box to UTF-8, matching how you intend to work on the rmeote UNIX host. I do not know what ramifications that may have for your local efilesystems of text files. Cheers, -- Cameron Simpson <c...@zip.com.au> Humans are incapable of securely storing high quality cryptographic keys and they have unacceptable speed and accuracy when performing cryptographic operations. (They are also large, expensive to maintain diffcult to manage and they pollute the environment.) It is astonishing that these devices continue to be manufactured and deployed. But they are suffciently pervasive that we must design our protocols around their limitations. - C Kaufman, R Perlman, M Speciner _Network Security: PRIVATE Communication in a PUBLIC World_, Prentice Hall, 1995, pp. 205. -- http://mail.python.org/mailman/listinfo/python-list