Re: [Bioc-devel] parsing embedded FASTA data

Michael Lawrence Tue, 18 Mar 2014 10:05:47 -0700

On Tue, Mar 18, 2014 at 7:54 AM, Gabriel Becker <gmbec...@ucdavis.edu>wrote:


> Or going the positive declarative route but arguably more informative:
> skip.to.fasta or fasta.only
>
>
skip.to.fasta might work. A different algorithm that would work for GFF3
would be skip.to.pragma="##FASTA", which would skip until it hit a line
matching "##FASTA".



> I don't know the GFF format spec, are we guaranteed that there will be
> only one embedded fasta file and that it will be contiguous within the
> file?
>

Yes, it is guaranteed that after a certain point in the file (that pragma),
all data is FASTA formatted.


> If not the skip.to._ terminology would not technically be correct.
>
> ~G
>
>
> On Mon, Mar 17, 2014 at 8:17 PM, Michael Lawrence <
> lawrence.mich...@gene.com> wrote:
>
>> For direct reading of the sequence, the skip.non.fasta idea sounds good.
>> An
>> alternative for the name would be "skip.to.first.record". Up to you.
>>
>> Michael
>>
>>
>> On Mon, Mar 17, 2014 at 5:33 PM, Hervé Pagès <hpa...@fhcrc.org> wrote:
>>
>> > Hi Michael,
>> >
>> >
>> > On 03/17/2014 04:15 PM, Michael Lawrence wrote:
>> >
>> >> Hi Herve,
>> >>
>> >> What would be a clean way for rtracklayer to extract the (optional)
>> FASTA
>> >> data embedded in a GFF3 file and parse it as an XStringSet? Is there a
>> >> low-level way to pass in-memory data to the parser in Biostrings?
>> >>
>> >
>> > Not that it can be used here, but readDNAStringSet() has the 'skip' arg
>> > which is analogous to the 'skip' arg of read.table(), except that, in
>> > the case of readDNAStringSet(), it needs to be specified as the number
>> > of records (FASTA or FASTQ) to skip before beginning to read in
>> > records. So the assumption is that everything before the first record
>> > to read is valid FASTA (or FASTQ). Which is of course not the case
>> > with those GFF3 files with embedded FASTA data.
>> >
>> > However it would be easy to add another arg, say 'skip.non.fasta.lines',
>> > to automatically skip lines that don't look like the header of a FASTA
>> > record (i.e. that don't start with '>').
>> >
>> >
>> >
>> >> In terms of the API, import,GFFFile could return a GRanges with the
>> >> DNAStringSet in the metadata(). Or there could be a method for
>> >> readDNAStringSet on GFF3File that returns the DNAStringSet directly.
>> >>
>> >
>> > The readDNAStringSet,GFF3File method seems cleaner than the metadata()
>> > solution. It's also lower-level and would be needed behind the scene by
>> > import,GFFFile, so I think it would make sense to start with it.
>> > Implementing readDNAStringSet,GFF3File will be trivial once we have
>> > something like the 'skip.non.fasta' arg. Should I go for it? Any better
>> > suggestion for the name of this arg?
>> >
>> > Thanks,
>> > H.
>> >
>> >
>> >> It turns out this functionality is useful when working with microbial
>> >> genomes, where information tends to be passed around as Genbank files.
>> For
>> >> right now the easiest path seems to be to convert Genbank to GFF, but a
>> >> Genbank parser in Bioc could be an eventual goal. It's a very complex
>> file
>> >> format.
>> >>
>> >> Michael
>> >>
>> >>         [[alternative HTML version deleted]]
>> >>
>> >> _______________________________________________
>> >> Bioc-devel@r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>> >>
>> > --
>> > Hervé Pagès
>> >
>> > Program in Computational Biology
>> > Division of Public Health Sciences
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N, M1-B514
>> > P.O. Box 19024
>> > Seattle, WA 98109-1024
>> >
>> > E-mail: hpa...@fhcrc.org
>> > Phone:  (206) 667-5791
>> > Fax:    (206) 667-1319
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
>
> --
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] parsing embedded FASTA data

Reply via email to