2012/3/19 Edgar Friendly <thelema...@gmail.com> > On 03/19/2012 05:08 AM, Philippe Veber wrote: > >> Thanks Edgar and Jérémie, this indeed seems to be the right track. I >> just hope that a repeated use of input_char is not 10-100X slower than >> input_line :o). >> ph. >> >> Quite true - instead of giving the matcher just a single byte at a time, > it is more efficient to give it blocks of data, as long as it can keep its > state from one block to the next. But its matching internally will be on > one byte at a time, normally.
Thanks for the confirmation, I now see more clearly what to do. > I guess with DNA, because of the reduced character set, it'd be possible > to get each symbol down to 2 bits (if you're really just using ACGT), in > which case, the matcher could run 4 basepairs at a time, but there's a lot > of corner issues doing things that way. A lot depends on how much time and > effort you're willing to spend engineering something. > Maybe not that far yet, but this is something we've mentionned for biocaml. I guess I could take some inspiration from the bitset module in Batteries. Anyway thanks everybody for your help! ph. > > E. > > 2012/3/16 Edgar Friendly <thelema...@gmail.com >> <mailto:thelema...@gmail.com>> >> >> >> So given a large file and a line number, you want to: >> 1) extract that line from the file >> 2) produce an enum of all k-length slices of that line? >> 3) match each slice against your regexp set to produce a list/enum >> of substrings that match the regexps? >> Without reading the whole line into memory at once. >> >> I'm with Dimino on the right solution - just use a matcher that that >> works incrementally, feed it one byte at a time, and have it return >> a list of match offsets. Then work backwards from these endpoints >> to figure out which substrings you want. >> >> There shouldn't be a reason to use substrings (0,k-1) and (1,k) - it >> should suffice to use (0,k-1) and (k,2k-1) with an incremental >> matching routine. >> >> E. >> >> >> >> On Fri, Mar 16, 2012 at 10:48 AM, Philippe Veber >> <philippe.ve...@gmail.com >> <mailto:philippe.veber@gmail.**com<philippe.ve...@gmail.com>>> >> wrote: >> >> Thank you Edgar for your answer (and also Christophe). It seems >> my question was a bit misleading: actually I target a subset of >> regexps whose matching is really trivial, so this is no worry >> for me. I was more interested in how accessing a large line in a >> file by chunks of fixed length k. For instance how to build a >> [Substring.t Enum.t] from some line in a file, without building >> the whole line in memory. This enum would yield the substrings >> (0,k-1), (1,k), (2,k+1), etc ... without doing too many string >> copy/concat operations. I think I can do it myself but I'm not >> too confident regarding good practices on buffered reads of >> files. Maybe there are some good examples in Batteries? >> >> Thanks again, >> ph. >> >> >> >> >> > -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs