-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 22/11/2010 14:25, Asaf Bartov wrote: > Note that the bug exists in GNU/Linux as well -- it's just better hidden... > :) > UTF8 uses a _variable_ amount of bytes to encode a code point. Often a > single byte is enough. But if your filename includes very special > characters, such as an "em-dash" (–) or an IPA charachter such as *ʧ* -- > then the character would take up two bytes, and for some obscure characters > it can be up to _four_ bytes.
There is no issue I think with UTF8 neither with libzim nor with Kiwix... and file names with em-dash. I have tested and it works. The reason is I think that the kernel interprets the char* string directly as UTF8 (ext3/4 is in UTF8). But on Windows, this is not possible to interpret directly the char* as UTF16, otherwise if you give a ASCII encoded path it won't work. So I suppose, STL open() & co (or the kernel) make a charset conversion to UTF16 before asking the filesystem. So if you want to open a file with character not in the ASCII charset, I suppose you have to use a special STL open() accepting wchar and give the path directly in UTF16. That is my theory. > So French accents fit in one byte, but some other characters do not. If I > had a ZIM file with such a character on GNU/Linux, the code would fail too. Does not looks like :) > We do need a portable solution. I don't know the right way to do it off > the top of my head, so perhaps someone else on the list can offer advice. > If no one can, I'm willing to figure it out myself. Yes, would be great. Tommi, your are the STL expert :) Thanks for your feedback Asaf. Emmanuel -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkzqyTQACgkQn3IpJRpNWtO1iQCfcObWOOjHcuyzCk7lOZitQVVf g/8AoK1GVk+FewIF5JJwZSa3C0iW1lcA =+iYd -----END PGP SIGNATURE----- _______________________________________________ dev-l mailing list [email protected] https://intern.openzim.org/mailman/listinfo/dev-l
