So we've been going back and forth on strings elsewhere, and Ben made a
(perfectly reasonable) statement that's been bugging me. Something like:

There may be *specialized applications* like lexing and parsing where users
should use specialized libraries.

The first thing to say here is that Ben is absolutely (and obviously) right.
Which disturbs me, because I'm really convinced that being able to provide a
migration path form C[++] to something safe is pretty important. The issues
involved are less about BitC *per se *(though they do touch on us) then they
are about GCd and manual storage management. Since his comment hit me when I
was gearing up to self-host the BitC lexer, and it turns out that lexing is
a good case in point, let me go through that one.

The problem with lexing comes in two places:

   1. Naive implementations are copy-intensive. Lots and lots of data
   copies.
   2. Traditional implementations rely heavily on mutability. Which isn't
   bad *per se*, but it raises idiom issues we should think about.

Let's look first at copying. I will ignore copies performed at the OS kernel
level, since these are presumably neutral to both solutions. Given a lexer
written in C, copies are made in several places:

   1. In the stdio library, for the purpose of amortizing system call and
   file system access overheads.
   2. Within the lexer, where we accumulate each token into a distinct
   string. Strictly speaking we only need to do this for identifiers, but this
   is the majority of tokens in any case. I have seen this done either by using
   something like std:string (or a C equivalent) or by using a string-intern
   mechanism. When implemented in C, the second trades loss of storage
   reclamation for a significant reduction in complexity and improvement in
   speed - the storage allocations performed by std:string are staggeringly
   expensive. The interning approach does not appear to be widely used in C.
   3. If a std:string style of identifier accumulation is performed, then
   the strings tend to get copied every time their containing ASTs get copied.
   That is: once you are on the std:string model, you tend to have gotten off
   the GC model.

While it is rare to *use* it, the std:string approach is also expensive
because it must be prepared to grow the string on the fly. The more common
practice is to preallocate each std:string instance at 8 bytes (which is
more than you usually need) and then grow on demand. The problem with this
practice is that you have to test against the need for growth with every
character append.

I have seen implementations that build (interned) string tables. These work
very well, but require disciplined programming when implemented in C.


When I think of lexers that are written in GC'd languages prior to .Net, I
can't think of any that *don't* intern strings. What happens in
high-performance lexers in these languages is that the stdio package is
avoided, and I/O is performed directly into a vector managed by the lexer.
When the current (pending) token falls off the end of the vector, it is
copied to the bottom and a new read is performed. If you manage to get a
single token that overflows the I/O buffer, you grow it (this code is never
exercised in the field). The "intern" call is handed the I/O buffer and a
pair of start/stop indices. Less aggressive designs may assemble the pending
token into a reversed list and rely on nursery GC to collect the list cells.

Note that both of these designs assume an absence of concurrent update on
the vector. If the vector content is moving while the intern logic is
scanning it (or equivalently, the linked list content) you can get in to
real trouble.

Note further that there is absolutely nothing precluding the interned string
approach to be used in C other than the traditions of the respective memory
management idioms. The interned approach is strictly more efficient when
correctly implemented. Note that interning precludes collection in the GC'd
approach unless you do something excessively clever with finalizers (which
isn't worth the bother).

The point of all this, I suppose, is to illustrate that Ben really
*is*correct, and that the preferred idioms in the two cases really
*are* quite different. It bears thinking about, mainly because of the
conversion issues.


shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to