Martin v. Löwis <mar...@v.loewis.de> added the comment: >> I'm puzzled why you use a hard-coded list of script names. The set of >> scripts will certainly change across Unicode versions, and I think it >> would be better to learn the script names from Scripts.txt. > > I hardcoded the list, because I saw no easy way to get the indexes > consistent across both versions of the database.
Couldn't you have a global cache, something like scripts = ['Unknown'] def findscript(script): try: return scripts.index(script) except ValueError: scripts.append(script) return len(scripts)-1 >> Out of curiosity: how does the addition of the script property affect >> the number of distinct database records, and the total size of the database? > > I'm not exactly sure how to measure this, but the length of > _PyUnicode_Database_Records goes from 229 entries to 690 entries. I think this needs to be fixed, then - we need to study why there are so many new records (e.g. what script contributes most new records), and then look for alternatives. One alternative could be to create a separate Trie for scripts. I'd also be curious if we can increase the homogeneity of scripts (i.e. produce longer runs of equal scripts) if we declare that unassigned code points have the script that corresponds to the block (i.e. the script that surrounding characters have), and then only change it to "Unknown" at lookup time if it's unassigned. > If it's any help I can post the output of makeunicodedata.py. I'd be interested in "size unicodedata.so", and how it changes. Perhaps the actual size increase isn't that bad. >> a) two functions are provided: one with the original script names, and >> one with the lower-case script names > > It this really neccessary, if we only have one version of the database? I don't know what this will be used for, but one application is certainly regular expressions. So we need an efficient test whether the character is in the expected script or not. It would be bad if such a test would have to do a .lower() on each lookup. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue6331> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com