Well, this was going to be a note on premature optimization, but as I wrote my way through it I realized it wasn't premature. Still, it's a mildly interesting saga.
As I've mentioned elsewhere, I'm trying to minimize dependency on external libraries so that we'll have all of the code in-hand when we want to implement a self-hosted compilers. One library I'm avoiding is libicu. Partly because it's enormous, and partly because some of the problems it solves are actually interesting and I'm learning about them. This brings me to a brief digression concerning Unicode ranges. Unicode is a very large character set with *horrible* density properties. If you print out the table of all characters in the "Ll" category (lowercase letters), you'll discover that the ASCII subrange is dense, but that most other encodings alternate uppercase with lowercase. The practical consequence of this is that there isn't any dense encoding for these tables unless you are willing to custom-design one (which, in this case, is worth it). For now, the right answer is to separate the basic plane from the others and use a 16-bit encoding in the basic plane. Implementing all of this led me to spend some time implementing a UCSRangeSet class - a set of integers that may have densely packed subranges. That turns out to be an interesting challenge if you want to deal correctly with possibly overlapping or abutting entries and keep things dense. I tend to try to implement collection classes completely the first time so that I can steal them later for something else. Because of the way std::set::lower_bound() is defined, the C++ std::set class can be abused to implement UCSRangeSet. Basically, you implement a range comparator operation having the property that overlapping ranges are non-comparable, and then you implement an insert routine that deals with the cases of overlap and possible merge. It's not so bad. So I got that done and turned to implementing NFA nodes. And naturally I thought I should implement an NFA as a set of (UCSRange => state) mappings. And then naturally I thought that I should do the work to keep those dense as well, because you really don't want to be iterating across 0x10FFFF character positions at every node. I built a lexer generator a long time ago, and even back when the algorithm only ran over 256 characters that was an expensive way to go. But it turns out that maintaining a dense map is actually kind of complicated. For example, if you see insertions: [a-f] => 1 [c-d] => 3 you have to split the first range to insert the second. If this particular case arises in NFA construction something is wrong, but it's not at all uncommon to see [a-f] => 1 c => 1 because somebody managed to add the same character to a range twice. But even if it's wrong for NFAs, that kind of insertion is something that a general-purpose DenseMap class would need to deal with. Except, of course, that there are a lot of corner cases. Overlap above. Overlap below. Containment in either direction. All of these with and without a matching destination (mapped) value - one case can sometimes be coalesced; the other cannot. It's all completely doable, but it's the kind of code that's hard to test and a nuisance to get right. Yet another example where reusable code is hard to write. Enough so that my head started to hurt. Unfortunately it needs to be done, because somebody will hand us the character set [^a], which contains 1114110 elements, and I really don't want to spend 17 megabytes encoding those transitions. And then you get to the fun part: building an iterator that operates over multiple range maps simultaneously... In the NFA->DFA conversion, you need to iterate through a *set* of nodes in lock-step. Basically, you proceed in sequence to identify all of the overlapping subranges and see which ones do *not* end up in the same NFA state. If you do this by iterating over characters it's simple, but expensive. If you do this by iterating over ranges, it's complicated, but relatively more efficient. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
