On 03/16/2012 09:03 AM, Philippe Veber wrote:
Dear camlers,

Say that you'd like to search a regexp on a file with lines so long that
you'd rather not load them entirely at once. If you can bound the size
of a match by k << length of a line, then you know that you can only
keep a small portion of the line in memory to search the regexp.
Typically you'd like to access substrings of size k from left to right.
I guess such a thing should involve buffered inputs and avoid copying
strings as much as possible. My question is as follows: has anybody
written a library to access these substrings gracefully and with decent
performance?
Cheers,
   Philippe.

This is tricky to do, as incremental matching implies DFA-based matching, but returning matching substrings implies backtrack-based matching. The re2 library returns matching substrings by matching forward to find the end of patterns, and then matching backwards on the reversed regex from that point to find their beginning. I don't know if even it supports this on incremental input.

Subject to the assumption that starting to match implies either finishing or aborting a match (i.e. once you've started to match your regex, no other match will start before either this match attempt finishes successful or not), this is not unreasonable to do. Without this assumption, it requires tracking many match start locations and somehow knowing which of them match or fail to match. I'm not sure this has been done before.

E.

--
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Reply via email to