On 03/19/2012 05:08 AM, Philippe Veber wrote:
Thanks Edgar and Jérémie, this indeed seems to be the right track. I
just hope that a repeated use of input_char is not 10-100X slower than
input_line :o).
ph.

Quite true - instead of giving the matcher just a single byte at a time, it is more efficient to give it blocks of data, as long as it can keep its state from one block to the next. But its matching internally will be on one byte at a time, normally. I guess with DNA, because of the reduced character set, it'd be possible to get each symbol down to 2 bits (if you're really just using ACGT), in which case, the matcher could run 4 basepairs at a time, but there's a lot of corner issues doing things that way. A lot depends on how much time and effort you're willing to spend engineering something.

E.

2012/3/16 Edgar Friendly <thelema...@gmail.com
<mailto:thelema...@gmail.com>>

    So given a large file and a line number, you want to:
    1) extract that line from the file
    2) produce an enum of all k-length slices of that line?
    3) match each slice against your regexp set to produce a list/enum
    of substrings that match the regexps?
    Without reading the whole line into memory at once.

    I'm with Dimino on the right solution - just use a matcher that that
    works incrementally, feed it one byte at a time, and have it return
    a list of match offsets.  Then work backwards from these endpoints
    to figure out which substrings you want.

    There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it
    should suffice to use (0,k-1) and (k,2k-1) with an incremental
    matching routine.

    E.



    On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
    <philippe.ve...@gmail.com <mailto:philippe.ve...@gmail.com>> wrote:

        Thank you Edgar for your answer (and also Christophe). It seems
        my question was a bit misleading: actually I target a subset of
        regexps whose matching is really trivial, so this is no worry
        for me. I was more interested in how accessing a large line in a
        file by chunks of fixed length k. For instance how to build a
        [Substring.t Enum.t] from some line in a file, without building
        the whole line in memory. This enum would yield the substrings
        (0,k-1), (1,k), (2,k+1), etc ... without doing too many string
        copy/concat operations. I think I can do it myself but I'm not
        too confident regarding good practices on buffered reads of
        files. Maybe there are some good examples in Batteries?

        Thanks again,
           ph.






--
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Reply via email to