Re: [Caml-list] Efficient scanning of large strings from files

Edgar Friendly Fri, 16 Mar 2012 10:02:32 -0700

So given a large file and a line number, you want to:
1) extract that line from the file
2) produce an enum of all k-length slices of that line?
3) match each slice against your regexp set to produce a list/enum of
substrings that match the regexps?
Without reading the whole line into memory at once.

I'm with Dimino on the right solution - just use a matcher that that works
incrementally, feed it one byte at a time, and have it return a list of
match offsets.  Then work backwards from these endpoints to figure out
which substrings you want.

There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it should
suffice to use (0,k-1) and (k,2k-1) with an incremental matching routine.

E.

On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
<philippe.ve...@gmail.com>wrote:

> Thank you Edgar for your answer (and also Christophe). It seems my
> question was a bit misleading: actually I target a subset of regexps whose
> matching is really trivial, so this is no worry for me. I was more
> interested in how accessing a large line in a file by chunks of fixed
> length k. For instance how to build a [Substring.t Enum.t] from some line
> in a file, without building the whole line in memory. This enum would yield
> the substrings (0,k-1), (1,k), (2,k+1), etc ... without doing too many
> string copy/concat operations. I think I can do it myself but I'm not too
> confident regarding good practices on buffered reads of files. Maybe there
> are some good examples in Batteries?
>
> Thanks again,
>   ph.
>
>
>

-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Re: [Caml-list] Efficient scanning of large strings from files

Reply via email to