Re: [Caml-list] Efficient scanning of large strings from files

Philippe Veber Wed, 21 Mar 2012 00:49:07 -0700

2012/3/19 Edgar Friendly <thelema...@gmail.com>

> On 03/19/2012 05:08 AM, Philippe Veber wrote:
>
>> Thanks Edgar and Jérémie, this indeed seems to be the right track. I
>> just hope that a repeated use of input_char is not 10-100X slower than
>> input_line :o).
>> ph.
>>
>>  Quite true - instead of giving the matcher just a single byte at a time,
> it is more efficient to give it blocks of data, as long as it can keep its
> state from one block to the next.  But its matching internally will be on
> one byte at a time, normally.


Thanks for the confirmation, I now see more clearly what to do.


> I guess with DNA, because of the reduced character set, it'd be possible
> to get each symbol down to 2 bits (if you're really just using ACGT), in
> which case, the matcher could run 4 basepairs at a time, but there's a lot
> of corner issues doing things that way.  A lot depends on how much time and
> effort you're willing to spend engineering something.
>
Maybe not that far yet, but this is something we've mentionned for biocaml.
I guess I could take some inspiration from the bitset module in Batteries.
Anyway thanks everybody for your help!
ph.


>
> E.
>
>  2012/3/16 Edgar Friendly <thelema...@gmail.com
>> <mailto:thelema...@gmail.com>>
>>
>>
>>    So given a large file and a line number, you want to:
>>    1) extract that line from the file
>>    2) produce an enum of all k-length slices of that line?
>>    3) match each slice against your regexp set to produce a list/enum
>>    of substrings that match the regexps?
>>    Without reading the whole line into memory at once.
>>
>>    I'm with Dimino on the right solution - just use a matcher that that
>>    works incrementally, feed it one byte at a time, and have it return
>>    a list of match offsets.  Then work backwards from these endpoints
>>    to figure out which substrings you want.
>>
>>    There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it
>>    should suffice to use (0,k-1) and (k,2k-1) with an incremental
>>    matching routine.
>>
>>    E.
>>
>>
>>
>>    On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber
>>    <philippe.ve...@gmail.com 
>> <mailto:philippe.veber@gmail.**com<philippe.ve...@gmail.com>>>
>> wrote:
>>
>>        Thank you Edgar for your answer (and also Christophe). It seems
>>        my question was a bit misleading: actually I target a subset of
>>        regexps whose matching is really trivial, so this is no worry
>>        for me. I was more interested in how accessing a large line in a
>>        file by chunks of fixed length k. For instance how to build a
>>        [Substring.t Enum.t] from some line in a file, without building
>>        the whole line in memory. This enum would yield the substrings
>>        (0,k-1), (1,k), (2,k+1), etc ... without doing too many string
>>        copy/concat operations. I think I can do it myself but I'm not
>>        too confident regarding good practices on buffered reads of
>>        files. Maybe there are some good examples in Batteries?
>>
>>        Thanks again,
>>           ph.
>>
>>
>>
>>
>>
>

-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Re: [Caml-list] Efficient scanning of large strings from files

Reply via email to