For direct reading of the sequence, the skip.non.fasta idea sounds good. An alternative for the name would be "skip.to.first.record". Up to you.
Michael On Mon, Mar 17, 2014 at 5:33 PM, Hervé Pagès <hpa...@fhcrc.org> wrote: > Hi Michael, > > > On 03/17/2014 04:15 PM, Michael Lawrence wrote: > >> Hi Herve, >> >> What would be a clean way for rtracklayer to extract the (optional) FASTA >> data embedded in a GFF3 file and parse it as an XStringSet? Is there a >> low-level way to pass in-memory data to the parser in Biostrings? >> > > Not that it can be used here, but readDNAStringSet() has the 'skip' arg > which is analogous to the 'skip' arg of read.table(), except that, in > the case of readDNAStringSet(), it needs to be specified as the number > of records (FASTA or FASTQ) to skip before beginning to read in > records. So the assumption is that everything before the first record > to read is valid FASTA (or FASTQ). Which is of course not the case > with those GFF3 files with embedded FASTA data. > > However it would be easy to add another arg, say 'skip.non.fasta.lines', > to automatically skip lines that don't look like the header of a FASTA > record (i.e. that don't start with '>'). > > > >> In terms of the API, import,GFFFile could return a GRanges with the >> DNAStringSet in the metadata(). Or there could be a method for >> readDNAStringSet on GFF3File that returns the DNAStringSet directly. >> > > The readDNAStringSet,GFF3File method seems cleaner than the metadata() > solution. It's also lower-level and would be needed behind the scene by > import,GFFFile, so I think it would make sense to start with it. > Implementing readDNAStringSet,GFF3File will be trivial once we have > something like the 'skip.non.fasta' arg. Should I go for it? Any better > suggestion for the name of this arg? > > Thanks, > H. > > >> It turns out this functionality is useful when working with microbial >> genomes, where information tends to be passed around as Genbank files. For >> right now the easiest path seems to be to convert Genbank to GFF, but a >> Genbank parser in Bioc could be an eventual goal. It's a very complex file >> format. >> >> Michael >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel