* Felipe Almeida Lessa <felipe.le...@gmail.com> [30.11.2011 01:40]: > On Tue, Nov 29, 2011 at 10:32 PM, Christian Höner zu Siederdissen > <choe...@tbi.univie.ac.at> wrote: > > how much interest is there for iteratee-based fasta reading? Has someone > > already written something? > > I don't know. While it would be nice, currently that's not something > that I need myself.
> > > Since iteratee- (or enumerator-based parsing in general) is strict in > > its output, there are some considerations regarding large files. On the > > other hand, sometime in early 2012 I'll probably provide a library to > > efficiently handle tasks on large sequence-based files. > > What do you mean by "strict in its output"? Do you mean that each > sequence of the FASTA file would need to be held in memory? > > I guess there are two different FASTA readers possible, depending on > if the stream is based on (just examples) > > data FastaSeq = FastaSeq SeqLabel SeqData > > or > > data FastaItem = FastaLabel SeqLabel | FastaData SeqData > > Using FastaSeq you get a simple-to-use interface that needs to hold > each sequence in memory. Using FastaItem you get something like a SAX > parser where the stream may be consumed in constant memory usage > (something like [FastaLabel ..., FastaData ..., FastaData ..., > FastaData ...] where each data chunk is of a limited size), but where > it's a little bit more difficult to write programs. > > Assuming that we wrote some FASTA parser using enumeratees, I guess > FastaItem is the way to go, since it's possible to have an enumeratee > that converts FastaItems into FastaSeqs. > Well, using iteratee run =<< (enumFile 8192 filename $ iterateeFasta) would give you a list of [FastaSeq] that is complety in memory. Of course, that is stupid if you want to handle the human genome... So, you'd put an enumeratee somewhere in there that does the real work. The above iteratee could require as little as 8k of memory to work, and would require only as much as the set of data we want to extract. That is mostly your SAX example. I still think lazy file reading is awesome but have bumped into the same reasons (mainly that I opened too many files at the same time) that led to iteratees often enough to really appreciate it now. Anyways, if you are interested, I'll upload the code in a few hours, it's getting early in austria and I have to fix up some things first ;-) Gruss, Christian > Cheers, > > -- > Felipe.
pgpQaGB1MA6ad.pgp
Description: PGP signature
_______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://malde.org/cgi-bin/mailman/listinfo/biohaskell