29-Jan-2013 01:04, Timon Gehr пишет:
On 01/28/2013 12:59 PM, Dmitry Olshansky wrote:
28-Jan-2013 15:48, Johannes Pfau пишет:
...

But to be fair that doesn't fit ranges very well. If you don't want to
do any allocation but still keep identifiers etc in memory this
basically means you have to keep the whole source in memory and this is
conceptually an array and not a range.


Not the whole source but to construct a table of all identifiers. The
source is awfully redundant because of repeated identifiers, spaces,
comments and what not. The set of unique identifiers is rather small.


Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My
own lexer-parser combination slices directly into the original source
code, for every token, in order to remember the exact source location,
and the last time I have measured, it ran faster than DMD's. I keep the
source around for error reporting anyway:

tt.d:27:5: error: no member 'n' for type 'A'
     a.n=2;
     ^~~

Since the tokens point directly into the source code, it is not
necessary to construct any other data structures in order to allow fast
retrieval of the appropriate source code line.

But it's clear that a general-purpose library might not want to impose
this restriction on storage upon it's clients. I think it is somewhat
helpful for speed though. The other thing I do is buffering tokens in a
contiguous ring buffer that grows if a lot of lookahead is requested.

I think the best course of action is to just provide a hook to trigger
on every identifier encountered. That could be as discussed earlier a
delegate.

...

Maybe. I map identifiers to unique id's later, in the identifier AST
node constructor, though.


In allocation scheme I proposed that ID could be a 32bit offset into the unique identifiers chunk. Identifiers themselves then would have to be stored with sentinels (e.g. zeros) at end like in C. Then the 'map' process comes for free.

The only overhead is actually copying the bytes to this buffer but it only happens once per unique identifier and in exchange you get to lex anything not only 'the whole module in one memory block' kind of thing.
On the plus side it's also more cache friendly.

Overall I think it's a good trade-off, but I never had the time to exercise it.

--
Dmitry Olshansky

Reply via email to