Re: [Bioc-devel] parsing embedded FASTA data

Hervé Pagès Mon, 24 Mar 2014 10:09:41 -0700

Hi Michael,

This is now supported in Biostrings 2.31.17.


Cheers,
H.


On 03/18/2014 11:42 AM, Hervé Pagès wrote:

Hi,

On 03/18/2014 10:04 AM, Michael Lawrence wrote:




On Tue, Mar 18, 2014 at 7:54 AM, Gabriel Becker <gmbec...@ucdavis.edu
<mailto:gmbec...@ucdavis.edu>> wrote:

    Or going the positive declarative route but arguably more
    informative: skip.to.fasta or fasta.only


skip.to.fasta might work. A different algorithm that would work for GFF3
would be skip.to.pragma="##FASTA", which would skip until it hit a line
matching "##FASTA".


    I don't know the GFF format spec, are we guaranteed that there will
    be only one embedded fasta file and that it will be contiguous
    within the file?


Yes, it is guaranteed that after a certain point in the file (that
pragma), all data is FASTA formatted.


Thanks for the suggestions. I think I'll go for 'seek.first.rec', just
to keep it generic and not tied to the specifics of the GFF, FASTA, or
FASTQ formats.

H.


    If not the skip.to._ terminology would not technically be correct.

    ~G


    On Mon, Mar 17, 2014 at 8:17 PM, Michael Lawrence
    <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote:

        For direct reading of the sequence, the skip.non.fasta idea
        sounds good. An
        alternative for the name would be "skip.to.first.record". Up
to you.

        Michael


        On Mon, Mar 17, 2014 at 5:33 PM, Hervé Pagès <hpa...@fhcrc.org
        <mailto:hpa...@fhcrc.org>> wrote:

         > Hi Michael,
         >
         >
         > On 03/17/2014 04:15 PM, Michael Lawrence wrote:
         >
         >> Hi Herve,
         >>
         >> What would be a clean way for rtracklayer to extract the
        (optional) FASTA
         >> data embedded in a GFF3 file and parse it as an XStringSet?
        Is there a
         >> low-level way to pass in-memory data to the parser in
        Biostrings?
         >>
         >
         > Not that it can be used here, but readDNAStringSet() has the
        'skip' arg
         > which is analogous to the 'skip' arg of read.table(), except
        that, in
         > the case of readDNAStringSet(), it needs to be specified as
        the number
         > of records (FASTA or FASTQ) to skip before beginning to
read in
         > records. So the assumption is that everything before the
        first record
         > to read is valid FASTA (or FASTQ). Which is of course not the
        case
         > with those GFF3 files with embedded FASTA data.
         >
         > However it would be easy to add another arg, say
        'skip.non.fasta.lines',
         > to automatically skip lines that don't look like the header
        of a FASTA
         > record (i.e. that don't start with '>').
         >
         >
         >
         >> In terms of the API, import,GFFFile could return a GRanges
        with the
         >> DNAStringSet in the metadata(). Or there could be a method
for
         >> readDNAStringSet on GFF3File that returns the DNAStringSet
        directly.
         >>
         >
         > The readDNAStringSet,GFF3File method seems cleaner than the
        metadata()
         > solution. It's also lower-level and would be needed behind
        the scene by
         > import,GFFFile, so I think it would make sense to start
with it.
         > Implementing readDNAStringSet,GFF3File will be trivial once
        we have
         > something like the 'skip.non.fasta' arg. Should I go for it?
        Any better
         > suggestion for the name of this arg?
         >
         > Thanks,
         > H.
         >
         >
         >> It turns out this functionality is useful when working with
        microbial
         >> genomes, where information tends to be passed around as
        Genbank files. For
         >> right now the easiest path seems to be to convert Genbank to
        GFF, but a
         >> Genbank parser in Bioc could be an eventual goal. It's a
        very complex file
         >> format.
         >>
         >> Michael
         >>
         >>         [[alternative HTML version deleted]]
         >>
         >> _______________________________________________
         >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        mailing list
         >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
         >>
         >>
         > --
         > Hervé Pagès
         >
         > Program in Computational Biology
         > Division of Public Health Sciences
         > Fred Hutchinson Cancer Research Center
         > 1100 Fairview Ave. N, M1-B514
         > P.O. Box 19024
         > Seattle, WA 98109-1024
         >
         > E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
         > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
         > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
         >

                 [[alternative HTML version deleted]]


        _______________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/listinfo/bioc-devel




    --
    Gabriel Becker
    Graduate Student
    Statistics Department
    University of California, Davis


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] parsing embedded FASTA data

Reply via email to