Daniel Hillier <daniel.hill...@gmail.com> added the comment:

Looking into this more and it appears that while Appendix D of 
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT says "If general 
purpose bit 11 is unset, the file name and comment SHOULD conform to the 
original ZIP character encoding" where the original encoding is IBM 437 
(cp437), this is not always followed. This isn't too surprising as cp437 
doesn't have every character for every language! In particular, some archive 
programs on windows will use the user's locale code page.

https://superuser.com/questions/1321371/proper-encoding-for-file-names-in-zip-archives-created-in-windows-and-unpacked-i

A UTF filename can be stored in the extra field 0x7075 in addition to a 
filename encoded in an arbitrary code page stored in the header's filename 
section. There is a open issue to add handling these fields (for reading) to 
zipfile: https://bugs.python.org/issue41928 and that issue may be related to 
this one https://bugs.python.org/issue40407

For this issue, with regards to encoding, I prefer the current situation where 
general purpose bit 11 for UTF is preferentially used because it doesn't change 
the behaviour compared to previous Python versions and it reduces file size as 
the filename isn't repeated in the extra field.

For compatibility with other archive programs that don't support the general 
purpose bit 11, I suggest we add an additional mechanism to allow the code page 
for the path name (and comment) to be set and use the 0x7075 extra field to 
store the UTF name in those cases where the filename can't be encoded in ascii 
(and 0x6075 to store the utf comment where it can't be encoded in ascii)

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue40172>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to