[issue6331] Add unicode script info to the unicode database

Martin v . Löwis Wed, 24 Jun 2009 12:32:12 -0700

Martin v. Löwis <mar...@v.loewis.de> added the comment:

>> I'm puzzled why you use a hard-coded list of script names. The set of
>> scripts will certainly change across Unicode versions, and I think it
>> would be better to learn the script names from Scripts.txt.
> 
> I hardcoded the list, because I saw no easy way to get the indexes
> consistent across both versions of the database.


Couldn't you have a global cache, something like

scripts = ['Unknown']
def findscript(script):
  try:
    return scripts.index(script)
  except ValueError:
    scripts.append(script)
    return len(scripts)-1

>> Out of curiosity: how does the addition of the script property affect
>> the number of distinct database records, and the total size of the database?
> 
> I'm not exactly sure how to measure this, but the length of
> _PyUnicode_Database_Records goes from 229 entries to 690 entries.

I think this needs to be fixed, then - we need to study why there are
so many new records (e.g. what script contributes most new records),
and then look for alternatives.

One alternative could be to create a separate Trie for scripts.

I'd also be curious if we can increase the homogeneity of scripts
(i.e. produce longer runs of equal scripts) if we declare that
unassigned code points have the script that corresponds to the block
(i.e. the script that surrounding characters have), and then only
change it to "Unknown" at lookup time if it's unassigned.

> If it's any help I can post the output of makeunicodedata.py.

I'd be interested in "size unicodedata.so", and how it changes.
Perhaps the actual size increase isn't that bad.

>> a) two functions are provided: one with the original script names, and
>> one with the lower-case script names
> 
> It this really neccessary, if we only have one version of the database?

I don't know what this will be used for, but one application is
certainly regular expressions. So we need an efficient test whether
the character is in the expected script or not. It would be bad if
such a test would have to do a .lower() on each lookup.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue6331>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue6331] Add unicode script info to the unicode database

Reply via email to