Le mercredi 24 août 2011 20:52:51, Glenn Linderman a écrit : > Given the required variability of character size in all presently > Unicode defined encodings, I tend to agree with Tom that UTF-8, together > with some technique of translating character index to code unit offset, > may provide the best overall space utilization, and adequate CPU > efficiency.
UTF-8 can use more space than latin1 or UCS2: >>> text="abc"; len(text.encode("latin1")), len(text.encode("utf8")) (3, 3) >>> text="ééé"; len(text.encode("latin1")), len(text.encode("utf8")) (3, 6) >>> text="€€€"; len(text.encode("utf-16-le")), len(text.encode("utf8")) (6, 9) >>> text="北京"; len(text.encode("utf-16-le")), len(text.encode("utf8")) (4, 6) UTF-8 uses less space than PEP 393 only if you have few non-ASCII characters (or few non-BMP characters). About speed, I guess than O(n) (UTF8 indexing) is slower than O(1) (PEP 393 indexing). > ... Applications that support long > strings are more likely to bitten by the occasional "outlier" character > that is longer than the average character, doubling or quadrupling the > space needed to represent such strings, and eliminating a significant > portion of the space savings the PEP is providing for other > applications. In these worst cases, the PEP 393 is not worse than the current implementation: it just as much memory than Python in wide mode (mode used on Linux and Mac OS X because wchar_t is 32 bits). But it uses the double of Python in narrow mode (Windows). I agree than UTF-8 is better in these corner cases, but I also bet than most Python programs will use less memory and will be faster with the PEP 393. You can already try the pep-393 branch on your own programs. > Benchmarks may or may not fully reflect the actual > requirements of all applications, so conclusions based on benchmarking > can easily be blind-sided the realities of other applications, unless > the benchmarks are carefully constructed. I used stringbench and "./python -m test test_unicode". I plan to try iobench. Which other benchmark tool should be used? Should we write a new one? > It is possible that the ideas in PEP 393, with its support for multiple > underlying representations, could be the basis for some more complex > representations that would better support characters rather than only > supporting code points, ... I don't think that the *default* Unicode type is the best place for this. The base Unicode type has to be *very* efficient. If you have unusual needs, write your own type. Maybe based on the base type? Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com