Martin v. Löwis wrote: > Neal Norwitz wrote: >> See http://python.org/sf/1454485 for the gory details. Basically if >> you create a unicode array (array.array('u')) and try to append an >> 8-bit string (ie, not unicode), you can crash the interpreter. >> >> The problem is that the string is converted without question to a >> unicode buffer. Within unicode, it assumes the data to be valid, but >> this isn't necessarily the case. We wind up accessing an array with a >> negative index and boom. > > There are several problems combined here, which might need discussion: > > - why does the 'u#' converter use the buffer interface if available? > it should just support Unicode objects. The buffer object makes > no promise that the buffer actually is meaningful UCS-2/UCS-4, so > u# shouldn't guess that it is. > (FWIW, it currently truncates the buffer size to the next-smaller > multiple of sizeof(Py_UNICODE), and silently so) > > I think that part should just go: u# should be restricted to unicode > objects.
'u#' is intended to match 's#' which also uses the buffer interface. It expects the buffer returned by the object to a be a Py_UNICODE* buffer, hence the calculation of the length. However, we already have 'es#' which is a lot safer to use in this respect: you can explicity define the encoding you want to see, e.g. 'unicode-internal' and the associated codec also takes care of range checks, etc. So, I'm +1 on restricting 'u#' to Unicode objects. > - should Python guarantee that all characters in a Unicode object > are between 0 and sys.maxunicode? Currently, it is possible to > create Unicode strings with either negative or very large Py_UNICODE > elements. > > - if the answer to the last question is no (i.e. if it is intentional > that a unicode object can contain arbitrary Py_UNICODE values): should > Python then guarantee that Py_UNICODE is an unsigned type? Py_UNICODE must always be unsigned. The whole implementation relies on this and has been designed with this in mind (see PEP 100). AFAICT, the configure does check that Py_UNICODE is always unsigned. Regarding the permitted range of values, I think the necessary overhead to check that all Py_UNICODE* array values are within the currently permitted range would unnecessarily slow down the implementation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 31 2006) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com