On 03/06/2011 02:24 PM, Stephan Hennig wrote:
Let me put it this way: In LuaTeX, the top-level data structure to organize the patterns of a language is still a trie, with additional data attached (fall-back transitions) that make the trie behave like a finite state automaton. Would that be wrong?
It depends on how you see the word 'trie': there is nowhere in luatex a data structure that looks like an actual trie. The algorithm of course uses a prefix test as the first step, but it checks for a key in a hash instead of walking a tree. I am not too good in explaining this, but you can look at luatexdir/lang/hyphen.w for the actual details.
BTW, am I correctly assuming that the reason for switching from a packed array data structure to a hash is that the former only works as long as the keys (characters) are from a finite range. Which is the case for a 7 or 8 bit character encoding. But with Unicode characters this is (practically) not the case?
Mostly. A secondary reason is that the original routines converted the trie into a 'compressed trie', after which point it became immutable. We did not like that, it is quite handy to be able to add new \patterns (for other languages) on the fly, and changing the existing code would have been quite a lot of work. A+B meant that the hash-based approach from LibHNJ was much easier to incorporate. Best wishes, Taco
