Re: [Bioc-devel] parsing embedded FASTA data

Gabriel Becker Tue, 18 Mar 2014 07:55:40 -0700

Or going the positive declarative route but arguably more informative:
skip.to.fasta or fasta.only


I don't know the GFF format spec, are we guaranteed that there will be only
one embedded fasta file and that it will be contiguous within the file? If
not the skip.to._ terminology would not technically be correct.

~G


On Mon, Mar 17, 2014 at 8:17 PM, Michael Lawrence <lawrence.mich...@gene.com
> wrote:

> For direct reading of the sequence, the skip.non.fasta idea sounds good. An
> alternative for the name would be "skip.to.first.record". Up to you.
>
> Michael
>
>
> On Mon, Mar 17, 2014 at 5:33 PM, Hervé Pagès <hpa...@fhcrc.org> wrote:
>
> > Hi Michael,
> >
> >
> > On 03/17/2014 04:15 PM, Michael Lawrence wrote:
> >
> >> Hi Herve,
> >>
> >> What would be a clean way for rtracklayer to extract the (optional)
> FASTA
> >> data embedded in a GFF3 file and parse it as an XStringSet? Is there a
> >> low-level way to pass in-memory data to the parser in Biostrings?
> >>
> >
> > Not that it can be used here, but readDNAStringSet() has the 'skip' arg
> > which is analogous to the 'skip' arg of read.table(), except that, in
> > the case of readDNAStringSet(), it needs to be specified as the number
> > of records (FASTA or FASTQ) to skip before beginning to read in
> > records. So the assumption is that everything before the first record
> > to read is valid FASTA (or FASTQ). Which is of course not the case
> > with those GFF3 files with embedded FASTA data.
> >
> > However it would be easy to add another arg, say 'skip.non.fasta.lines',
> > to automatically skip lines that don't look like the header of a FASTA
> > record (i.e. that don't start with '>').
> >
> >
> >
> >> In terms of the API, import,GFFFile could return a GRanges with the
> >> DNAStringSet in the metadata(). Or there could be a method for
> >> readDNAStringSet on GFF3File that returns the DNAStringSet directly.
> >>
> >
> > The readDNAStringSet,GFF3File method seems cleaner than the metadata()
> > solution. It's also lower-level and would be needed behind the scene by
> > import,GFFFile, so I think it would make sense to start with it.
> > Implementing readDNAStringSet,GFF3File will be trivial once we have
> > something like the 'skip.non.fasta' arg. Should I go for it? Any better
> > suggestion for the name of this arg?
> >
> > Thanks,
> > H.
> >
> >
> >> It turns out this functionality is useful when working with microbial
> >> genomes, where information tends to be passed around as Genbank files.
> For
> >> right now the easiest path seems to be to convert Genbank to GFF, but a
> >> Genbank parser in Bioc could be an eventual goal. It's a very complex
> file
> >> format.
> >>
> >> Michael
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> Bioc-devel@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >>
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpa...@fhcrc.org
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> >
>
>         [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>


-- 
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] parsing embedded FASTA data

Reply via email to