On Monday 29 September 2008, M.-A. Lemburg wrote: > On 2008-09-29 12:50, Ulrich Eckhardt wrote: > > 1. For POSIX platforms (using a byte string for the path): > > Here, the first approach is to convert the path to Unicode, according to > > the locale's CTYPE category. Hopefully, it will be UTF-8, but also > > codepages should work. If there is a segment (a byte sequence between two > > path separators) where it doesn't work, it uses an ASCII mapping where > > possible and codepoints from the "Private Use Area" (PUA) of Unicode for > > the non-decodable bytes. > > In order to pass this path to fopen(), each segment would be converted to > > a byte string again, using the locale's CTYPE category except for > > segments which use the PUA where it simply encodes the original bytes. > > I'm not sure how this would work. How would you map the private use > code points back to bytes ? Using a special codec that knows about > these code points ? How would the fopen() know to use that special > codec instead of e.g. the UTF-8 codec ?
Sorry, I wasn't clear enough. I'll try to explain further... Let's assume we have a filename like this: 0xc2 0xa9 0x2f 0x7f The first two bytes are the copyright sign encoded in UTF-8, followed by a slash (0x2f, path separator) and a character encoded in an unknown codepage (0x7f is not ASCII!). The first thing when receiving that path from the system would be to split it into segments, here we would get two of them, one with 0xc2 0xa9 and the other with 0x7f. This uses the fact that the separator (slash/0x2f) is rather universal (Note: I'm not sure about encodings like BIG5, i.e. ones that are neither UTF-8 nor derived from ASCII). For each segment, we would apply the locale's CTYPE facet and get the Unicode codepoint 0xa9 for the first segment, while the second one fails to convert. So, for the second one, we simply check for each byte if it is valid and printable ASCII (0x7f isn't). If it is, we emit the byte as Unicode codepoint. Otherwise, we map to the PUA. The PUA reserves 0xe000 to 0xf8ff for private uses. I would simply encode the byte 0x7f as 0xe07f, i.e. map it to the beginning of that range. Eventually, we would end up with the following Unicode codepoints: 0xa9, 0x2f, 0xe07f When converting to a byte string for use with fopen(), we simply inspect the supplied string again. If a segment contains elements of the PUA, we simply reverse the mapping for those and leave the others in that segment as-is. For all other segments, we apply the CTYPE conversion. Notes: * This effectively converts the current path representation (a string) into a sequence of segments where each segment can either be a fully Unicode-capable string or a raw byte string without any known interpretation. However, instead of using an array for that, it uses a string, which is what most people's code expects anyway. * You could also work on a byte-base instead of splitting the path in segments first. I just assumed that a single segment will not contain valid UTF-8 sequences mixed with invalid ones. A path however can contain both correctly and incorrectly encoded segments. > BTW: Private use areas in Unicode are meant for e.g. company specific > code points. Using them for escaping purposes is likely to cause problems > due to assignment clashes. I'm not sure if the use I proposed is correct according to the intended use of the PUA. I know that ideally no such string would escape from Python, i.e. it should only be visible internally. I would guess that that is something the PUA was intended for. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ************************************************************************************** Visit our website at <http://www.satorlaser.de/> ************************************************************************************** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. ************************************************************************************** _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com