Tony Nelson wrote: > BTW, Martin, if you care to, would you explain to me how a Trie would be > used for charmap encoding? I know a couple of approaches, but I don't know > how to do it fast. (I've never actually had the occasion to use a Trie.)
I currently envision a three-level trie, with 5, 4, and 7 bits. You take the Unicode character (only chacters below U+FFFF supported), and take the uppermost 5 bits, as index in an array. There you find the base of a second array, to which you add the next 4 bits, which gives you an index into a third array, where you add the last 7 bits. This gives you the character, or 0 if it is unmappable. struct encoding_map{ unsigned char level0[32]; unsigned char *level1; unsigned char *level2; }; struct encoding_map *table; Py_UNICODE character; int level1 = table->level0[character>>11]; if(level1==0xFF)raise unmapped; int level2 = table->level1[16*level1 + ((character>>7) & 0xF)]; if(level2==0xFF)raise unmapped; int mapped = table->level2[128*level2 + (character & 0x7F)]; if(mapped==0)raise unmapped; Over a hashtable, this has the advantage of not having to deal with collisions. Instead, it guarantees you a lookup in a constant time. It is also quite space-efficient: all tables use bytes as indizes. As each level0 deals with 2048 characters, most character maps will only use 1 or two level1 blocks, meaning 16 or 32 bytes for level1. The number of level3 blocks required depends on the number of 127-character rows which the encoding spans; for most encodings, 3 or four such blocks will be sufficient (with ASCII spanning one such block typically), causing the entire memory consumption for many encodings to be less than 600 bytes. It would be possible to remove one level of indirection (giving some more speed) at the expense of additional memory: for example, and 8-8 trie would use 256 bytes for level 0, and then 256 bytes for each Unicode row where the encoding has characters, likely resulting in 1kiB for a typical encoding. Hye-Shik Chang version of the fast codec uses such an 8-8 trie, but conserves space by observing that most 256-char rows are only sparsely used by encodings (e.g. if you have only ASCII, you use only 128 characters from row 0). He therefore strips unused cells from the row, by also recording the first and last character per row. This brings some space back, at the expense of slow-down again. Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com