On Tue, Mar 18, 2014 at 7:54 AM, Gabriel Becker <gmbec...@ucdavis.edu>wrote:
> Or going the positive declarative route but arguably more informative: > skip.to.fasta or fasta.only > > skip.to.fasta might work. A different algorithm that would work for GFF3 would be skip.to.pragma="##FASTA", which would skip until it hit a line matching "##FASTA". > I don't know the GFF format spec, are we guaranteed that there will be > only one embedded fasta file and that it will be contiguous within the > file? > Yes, it is guaranteed that after a certain point in the file (that pragma), all data is FASTA formatted. > If not the skip.to._ terminology would not technically be correct. > > ~G > > > On Mon, Mar 17, 2014 at 8:17 PM, Michael Lawrence < > lawrence.mich...@gene.com> wrote: > >> For direct reading of the sequence, the skip.non.fasta idea sounds good. >> An >> alternative for the name would be "skip.to.first.record". Up to you. >> >> Michael >> >> >> On Mon, Mar 17, 2014 at 5:33 PM, Hervé Pagès <hpa...@fhcrc.org> wrote: >> >> > Hi Michael, >> > >> > >> > On 03/17/2014 04:15 PM, Michael Lawrence wrote: >> > >> >> Hi Herve, >> >> >> >> What would be a clean way for rtracklayer to extract the (optional) >> FASTA >> >> data embedded in a GFF3 file and parse it as an XStringSet? Is there a >> >> low-level way to pass in-memory data to the parser in Biostrings? >> >> >> > >> > Not that it can be used here, but readDNAStringSet() has the 'skip' arg >> > which is analogous to the 'skip' arg of read.table(), except that, in >> > the case of readDNAStringSet(), it needs to be specified as the number >> > of records (FASTA or FASTQ) to skip before beginning to read in >> > records. So the assumption is that everything before the first record >> > to read is valid FASTA (or FASTQ). Which is of course not the case >> > with those GFF3 files with embedded FASTA data. >> > >> > However it would be easy to add another arg, say 'skip.non.fasta.lines', >> > to automatically skip lines that don't look like the header of a FASTA >> > record (i.e. that don't start with '>'). >> > >> > >> > >> >> In terms of the API, import,GFFFile could return a GRanges with the >> >> DNAStringSet in the metadata(). Or there could be a method for >> >> readDNAStringSet on GFF3File that returns the DNAStringSet directly. >> >> >> > >> > The readDNAStringSet,GFF3File method seems cleaner than the metadata() >> > solution. It's also lower-level and would be needed behind the scene by >> > import,GFFFile, so I think it would make sense to start with it. >> > Implementing readDNAStringSet,GFF3File will be trivial once we have >> > something like the 'skip.non.fasta' arg. Should I go for it? Any better >> > suggestion for the name of this arg? >> > >> > Thanks, >> > H. >> > >> > >> >> It turns out this functionality is useful when working with microbial >> >> genomes, where information tends to be passed around as Genbank files. >> For >> >> right now the easiest path seems to be to convert Genbank to GFF, but a >> >> Genbank parser in Bioc could be an eventual goal. It's a very complex >> file >> >> format. >> >> >> >> Michael >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> _______________________________________________ >> >> Bioc-devel@r-project.org mailing list >> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> >> >> >> > -- >> > Hervé Pagès >> > >> > Program in Computational Biology >> > Division of Public Health Sciences >> > Fred Hutchinson Cancer Research Center >> > 1100 Fairview Ave. N, M1-B514 >> > P.O. Box 19024 >> > Seattle, WA 98109-1024 >> > >> > E-mail: hpa...@fhcrc.org >> > Phone: (206) 667-5791 >> > Fax: (206) 667-1319 >> > >> >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> > > > -- > Gabriel Becker > Graduate Student > Statistics Department > University of California, Davis > [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel