Keean:

You make a couple of assertions in your response that I can't work my way
through. Could very well be that I just haven't thought this through hard
enough yet. Apologies, but I think this will make more sense if I take your
comments out of order.

BUFFERING

I *did* note the smileys, but I want to make sure we're laughing at the
same joke. How do you buffer a terabyte stream arriving over a network port
(thus not backwards seekable) on a machine with 4GB to 8GB of memory?

DO/DONE

You wrote:

I dont see the do / done problem. One looks for "d" "o" (optional space)
> some other token, the other "d" "o" "n" "e", there is no ambiguity.. The
> "do" parser should clearly reject "don...".


We're doing our token matching with simple comparison on the leading
string. In order to realize that the input 'd', 'o', 'n', 'e' should not
satisfy "do", we need to know one of two things:

1. The list of all tokens, by which we might come to know that there is a
potential longer match, or
2. The list of potential token separator/terminator characters, by which we
would know that the character after 'o' must be consumed under the maximal
munch rule.

The reason it's a problem is that I don't see either of those requirements
satisfied. There's no tokenizer state, so we don't know the token
terminator characters. Meanwhile, the "do" and "done" matching rules are
worlds apart with no obvious way to combine them into a single regexp.

Oh.  I see how GLSL parser did it. The keyword combinator checks that the
matched string is not trailed by a keyword continuation character. Is that
the short answer? I also see that this disambiguates keywords from other
identifiers **provided** we match the keywords first.

VIRTUAL TOKENIZATION

You wrote:

I use virtual tokenization, which is a parser-combinator. IE I take a
> parser description in combinators, and convert it into a tokenizer. It does
> both in one pass by effectively using lazy tokenization.


I'm probably missing something perfectly obvious, but I don't see how this
is done. I understand that if you can gather the tokens together into a
list of regexps you can generate a tokenizer. I also understand that the
tokenizer can be lazily turned into optimized code. What I don't see is how
to gather the regexps.

At the time the various constructs like (keyword "if") appear, they are
appearing in completely disconnected [sub]parsers. Later those [sub]parsers
are joined by connecting up the resulting functions. At that point you can
no longer walk the parsers to locate the contained keyword matchers.

I can certainly see how to do it if the type of "parser" is not (stream,
state) -> result. That is: if the return value of a combinator is not of
function type.

So: how are you able to extract the various keyword and similar regexps for
merging?


shap
_______________________________________________
bitc-dev mailing list
bitc-dev@coyotos.org
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to