Patches item #1552880, was opened at 2006-09-05 20:11 Message generated for change (Comment added) made by loewis You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1552880&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Core (C code) Group: Python 2.6 Status: Open Resolution: None Priority: 5 Submitted By: Kristj�n Valur (krisvale) Assigned to: Nobody/Anonymous (nobody) Summary: Unicode Imports Initial Comment: This patch modifies the import mechanism to fully support unicode pathnames on Windows. It does this by first converting each member of sys.path to utf-8. strings are encoded using the current locale. The whole of the import logic is then unchanged and works on the utf-8 strings as though they were regular ascii strings in the current locale. Only when file operations are done, such as stat() and open(), do we then convert from utf-8 back to unicode and use the Windows unicode APIs for the job. This is also done when initializing Module objects. This approach has the benefit of being of having a low impact on the importing logic, and is thus easy to verify. There is however some overhead with the conversions. At CCP games we used this approach, backported to python 2.3, to get unicode imports working for our game, EVE Online, and thereby solving installation issues in the far east. This patch is submitted as demonstration code to the python community. I would like to see unicode fully supported in 2.6. Cheers, Kristján ---------------------------------------------------------------------- >Comment By: Martin v. Löwis (loewis) Date: 2006-09-12 22:17 Message: Logged In: YES user_id=21627 krisvale: indeed, option 4 is platform dependent. Notice that on Linux, the file system encoding won't necessarily be UTF-8. Instead, the value depends on the locale, so it may be latin-1, latin-9, gb2312, ... This makes it even more dependent on the platform, and even the current user being logged in (such is life with locale-based approaches; the same is mostly true for Windows: "mbcs" can mean nearly anything). option 1) is Py3k-safe, where path names will be Unicode strings always. As you say, Unicode is a virulent type, so this approach would need a wide consensus. I'm personally leaning towards option 2: it is nearly backwards compatible, except for obscure cases where people have mbcs-encodable entries in sys.path already, and it is independent of manipulations of the system encoding. I also think that processing of PYTHONPATH should take Unicode into account, i.e. we should use _wgetenv to access PYTHONPATH in 2.6. That would make the feature truly useful, as then people could actually set sys.path to non-mbcs directlories from the outside. Notice that W9x support can be dropped in 2.6, so a W9x-compatible solution won't be required. In any case, I'd like to encourage you to continue working on this issue. I, too, like to see it in 2.6, but I did so ever since 2.1 or so (before PEP 277 was implemented), and it was wishful thinking. Somebody has to take action, and it is likely that it won't one of the past regular contributors (or else they had contributed it long ago - although I think Thomas Heller had something working at one point). ---------------------------------------------------------------------- Comment By: Anthony Baxter (anthonybaxter) Date: 2006-09-12 13:29 Message: Logged In: YES user_id=29957 There's a variety of modules in the standard library that reference __file__ - if it's potentially going to be a unicode string, these are going to need to be checked, as are their callers :-/ (Now that I've looked closer at some of the issues, I'm extremely glad this didn't go into 2.5 final at this late stage) ---------------------------------------------------------------------- Comment By: Kristj�n Valur (krisvale) Date: 2006-09-12 11:38 Message: Logged In: YES user_id=1262199 I submitted this mostly as a demonstration. I don't think the approach is necessarily suitable for a final implementation because of the use of utf-8 as an intermediate representation and the price of the conversions that keep happening. But perhaps this is the way to go, if we consider utf-8 to be a stage-1 default file system encoding for win32. I also agree that 4 is probably the most sensible approach. What about discrepancies between e.g. linux and windows then, when including from a non-trivial path? On linux we would get utf-8, on windows unicode? 1) would actually make a lot of sense, only in my experience this tends to lead to a kind of unicode-hell since a program touched by one unicode object tends to have it percolating down into every corner. ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2006-09-09 14:31 Message: Logged In: YES user_id=21627 First: Do you want to continue to work on this, or do you consider this just "demonstration code" (i.e. not contributed for inclusion in Python), hoping that somebody else implements this feature? I think the behavior of __file__ must be more consistent across platforms, and the selected behaviour must be documented somewhere. Several definitions of "consistent behavior" come to mind: 1. __file__ is always a Unicode string 2. __file__ is a byte string if its ASCII, else Unicode 3. __file__ is a byte string if its in the system encoding, else Unicode 4. __file__ is a byte string if its in the file system encoding, else Unicode. The documentation needs to be updated in several places, e.g. also for inspect.getfile. I would expect that pydoc would also need to be updated. Selecting from the options above: I believe 4 is most compatible with previous versions; 1 and 2 are most convenient to work with in applications like pydoc which have to generate HTML (1 is easier to work with, 2 is more compatible with previous versions). ---------------------------------------------------------------------- Comment By: Kristj�n Valur (krisvale) Date: 2006-09-09 13:38 Message: Logged In: YES user_id=1262199 >From the top of my head, it is now unicode. I consider trying to convert it back to the default encoding but decided not to to keep the patch brief. ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2006-09-08 23:03 Message: Logged In: YES user_id=21627 What is the value of the __file__ attribute of a module when this patch is used? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1552880&group_id=5470
_______________________________________________ Patches mailing list Patches@python.org http://mail.python.org/mailman/listinfo/patches