On Wed, Jun 3, 2015 at 12:35 PM, Keean Schupke <ke...@fry-it.com> wrote:
> On 3 Jun 2015 20:31, "Jonathan S. Shapiro" <s...@eros-os.org> wrote: > > BUFFERING > > > > I *did* note the smileys, but I want to make sure we're laughing at the > same joke. How do you buffer a terabyte stream arriving over a network port > (thus not backwards seekable) on a machine with 4GB to 8GB of memory? > > Use a seekable stream, and buffer as many pages in tmp RAM as you have > available ram. Kernel page buffers in the unified page cache work this way, > so you don't need to do anything special, just use a buffered stream and > fseek. > Keean: read the constraints again. The machine in question does not have adequate memory resources to buffer the input by many, many orders of magnitude. It is impossible to build a generally backwards-seekable stream under these conditions. Please note that this is *not* a contrived example. There are real examples of real inputs that are too large to buffer in full and must be processed at line rates. *Bounded* backtracking may be fine for such applications. Unbounded backtracking is not. It is a hard requirement that forward progress be guaranteed in consuming the input. > So: how are you able to extract the various keyword and similar regexps > for merging? > > Well the lazy tokenization I am currently using relies on white-space. So > "do " is clearly different from "done ". > > This is not ideal as you can't do things like "do{..." Although in this > case with backtracking there is no problem (because done would fail). > > You only get problems if both "done" and "do ne" are valid syntax. > I'm sorry - that doesn't tell me how you are going about extracting the keywords. You said in your earlier note that you were concerned about the need for context-sensitive tokenization. I think there are two cases of this: 1. Sub-languages, as when javascript code appears within an HTML script tag. This one isn't so bad. You can model the tokenizer as a stream and introduce sub-tokenizers that advance the stream input in a fashion conceptually similar to what fparsec's CharStream construct is doing. 2. Truly context sensitive lexing, in which certain lexemes are [or are not] admitted in certain parse contexts. In effect, this amounts to having different start states for the tokenizer, which is in turn equivalent to having multiple sub-tokenizers as I just sketched in [1]. It's a fairly vile language design approach, but you don't always control the language that you are dealing with, so sometimes there is little choice. Note that context-sensitive lexing makes it difficult to cache the tokens in the token stream when backtracking occurs. It's doable, but certainly not pretty... shap
_______________________________________________ bitc-dev mailing list bitc-dev@coyotos.org http://www.coyotos.org/mailman/listinfo/bitc-dev