On Wed, Jun 3, 2015 at 12:35 PM, Keean Schupke <ke...@fry-it.com> wrote:

> On 3 Jun 2015 20:31, "Jonathan S. Shapiro" <s...@eros-os.org> wrote:
> > BUFFERING
> >
> > I *did* note the smileys, but I want to make sure we're laughing at the
> same joke. How do you buffer a terabyte stream arriving over a network port
> (thus not backwards seekable) on a machine with 4GB to 8GB of memory?
>
> Use a seekable stream, and buffer as many pages in tmp RAM as you have
> available ram. Kernel page buffers in the unified page cache work this way,
> so you don't need to do anything special, just use a buffered stream and
> fseek.
>
Keean: read the constraints again. The machine in question does not have
adequate memory resources to buffer the input by many, many orders of
magnitude. It is impossible to build a generally backwards-seekable stream
under these conditions.

Please note that this is *not* a contrived example. There are real examples
of real inputs that are too large to buffer in full and must be processed
at line rates. *Bounded* backtracking may be fine for such applications.
Unbounded backtracking is not. It is a hard requirement that forward
progress be guaranteed in consuming the input.

> So: how are you able to extract the various keyword and similar regexps
> for merging?
>
> Well the lazy tokenization I am currently using relies on white-space. So
> "do " is clearly different from "done ".
>
> This is not ideal as you can't do things like "do{..." Although in this
> case with backtracking there is no problem (because done would fail).
>
> You only get problems if both "done" and "do ne" are valid syntax.
>
I'm sorry - that doesn't tell me how you are going about extracting the
keywords.

You said in your earlier note that you were concerned about the need for
context-sensitive tokenization. I think there are two cases of this:

1. Sub-languages, as when javascript code appears within an HTML script
tag. This one isn't so bad. You can model the tokenizer as a stream and
introduce sub-tokenizers that advance the stream input in a fashion
conceptually similar to what fparsec's CharStream construct is doing.

2. Truly context sensitive lexing, in which certain lexemes are [or are
not] admitted in certain parse contexts. In effect, this amounts to having
different start states for the tokenizer, which is in turn equivalent to
having multiple sub-tokenizers as I just sketched in [1]. It's a fairly
vile language design approach, but you don't always control the language
that you are dealing with, so sometimes there is little choice.

Note that context-sensitive lexing makes it difficult to cache the tokens
in the token stream when backtracking occurs. It's doable, but certainly
not pretty...


shap
_______________________________________________
bitc-dev mailing list
bitc-dev@coyotos.org
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to