Well, this was going to be a note on premature optimization, but as I wrote
my way through it I realized it wasn't premature. Still, it's a mildly
interesting saga.


As I've mentioned elsewhere, I'm trying to minimize dependency on external
libraries so that we'll have all of the code in-hand when we want to
implement a self-hosted compilers. One library I'm avoiding is libicu.
Partly because it's enormous, and partly because some of the problems it
solves are actually interesting and I'm learning about them. This brings me
to a brief digression concerning Unicode ranges.

Unicode is a very large character set with *horrible* density properties.
If you print out the table of all characters in the "Ll" category
(lowercase letters), you'll discover that the ASCII subrange is dense, but
that most other encodings alternate uppercase with lowercase. The practical
consequence of this is that there isn't any dense encoding for these tables
unless you are willing to custom-design one (which, in this case, is worth
it). For now, the right answer is to separate the basic plane from the
others and use a 16-bit encoding in the basic plane. Implementing all of
this led me to spend some time implementing a UCSRangeSet class - a set of
integers that may have densely packed subranges. That turns out to be an
interesting challenge if you want to deal correctly with possibly
overlapping or abutting entries and keep things dense. I tend to try to
implement collection classes completely the first time so that I can steal
them later for something else.

Because of the way std::set::lower_bound() is defined, the C++ std::set
class can be abused to implement UCSRangeSet. Basically, you implement a
range comparator operation having the property that overlapping ranges are
non-comparable, and then you implement an insert routine that deals with
the cases of overlap and possible merge. It's not so bad.


So I got that done and turned to implementing NFA nodes. And naturally I
thought I should implement an NFA as a set of (UCSRange => state) mappings.
And then naturally I thought that I should do the work to keep those dense
as well, because you really don't want to be iterating across 0x10FFFF
character positions at every node. I built a lexer generator a long time
ago, and even back when the algorithm only ran over 256 characters that was
an expensive way to go. But it turns out that maintaining a dense map is
actually kind of complicated. For example, if you see insertions:

   [a-f] => 1
   [c-d] => 3

you have to split the first range to insert the second. If this particular
case arises in NFA construction something is wrong, but it's not at all
uncommon to see

  [a-f] => 1
  c => 1

because somebody managed to add the same character to a range twice. But
even if it's wrong for NFAs, that kind of insertion is something that a
general-purpose DenseMap class would need to deal with. Except, of course,
that there are a lot of corner cases. Overlap above. Overlap below.
Containment in either direction. All of these with and without a matching
destination (mapped) value - one case can sometimes be coalesced; the other
cannot. It's all completely doable, but it's the kind of code that's hard
to test and a nuisance to get right. Yet another example where reusable
code is hard to write. Enough so that my head started to hurt.

Unfortunately it needs to be done, because somebody will hand us the
character set [^a], which contains 1114110 elements, and I really don't
want to spend 17 megabytes encoding those transitions.

And then you get to the fun part: building an iterator that operates over
multiple range maps simultaneously...

In the NFA->DFA conversion, you need to iterate through a *set* of nodes in
lock-step. Basically, you proceed in sequence to identify all of the
overlapping subranges and see which ones do *not* end up in the same NFA
state. If you do this by iterating over characters it's simple, but
expensive. If you do this by iterating over ranges, it's complicated, but
relatively more efficient.


shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to