Re: [Caml-list] Efficient scanning of large strings from files

Edgar Friendly Mon, 19 Mar 2012 06:44:31 -0700

On 03/19/2012 05:08 AM, Philippe Veber wrote:

Thanks Edgar and Jérémie, this indeed seems to be the right track. I
just hope that a repeated use of input_char is not 10-100X slower than
input_line :o).
ph.

Quite true - instead of giving the matcher just a single byte at a time,it is more efficient to give it blocks of data, as long as it can keepits state from one block to the next. But its matching internally willbe on one byte at a time, normally. I guess with DNA, because of thereduced character set, it'd be possible to get each symbol down to 2bits (if you're really just using ACGT), in which case, the matchercould run 4 basepairs at a time, but there's a lot of corner issuesdoing things that way. A lot depends on how much time and effort you'rewilling to spend engineering something.

E.

2012/3/16 Edgar Friendly <thelema...@gmail.com
<mailto:thelema...@gmail.com>>

    So given a large file and a line number, you want to:
    1) extract that line from the file
    2) produce an enum of all k-length slices of that line?
    3) match each slice against your regexp set to produce a list/enum
    of substrings that match the regexps?
    Without reading the whole line into memory at once.

    I'm with Dimino on the right solution - just use a matcher that that
    works incrementally, feed it one byte at a time, and have it return
    a list of match offsets.  Then work backwards from these endpoints
    to figure out which substrings you want.

    There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it
    should suffice to use (0,k-1) and (k,2k-1) with an incremental
    matching routine.

    E.



    On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
    <philippe.ve...@gmail.com <mailto:philippe.ve...@gmail.com>> wrote:

        Thank you Edgar for your answer (and also Christophe). It seems
        my question was a bit misleading: actually I target a subset of
        regexps whose matching is really trivial, so this is no worry
        for me. I was more interested in how accessing a large line in a
        file by chunks of fixed length k. For instance how to build a
        [Substring.t Enum.t] from some line in a file, without building
        the whole line in memory. This enum would yield the substrings
        (0,k-1), (1,k), (2,k+1), etc ... without doing too many string
        copy/concat operations. I think I can do it myself but I'm not
        too confident regarding good practices on buffered reads of
        files. Maybe there are some good examples in Batteries?

        Thanks again,
           ph.



--
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Re: [Caml-list] Efficient scanning of large strings from files

Reply via email to