Re: [Bioc-devel] parsing embedded FASTA data

Michael Lawrence Mon, 17 Mar 2014 20:23:20 -0700

For direct reading of the sequence, the skip.non.fasta idea sounds good. An
alternative for the name would be "skip.to.first.record". Up to you.


Michael


On Mon, Mar 17, 2014 at 5:33 PM, Hervé Pagès <hpa...@fhcrc.org> wrote:

> Hi Michael,
>
>
> On 03/17/2014 04:15 PM, Michael Lawrence wrote:
>
>> Hi Herve,
>>
>> What would be a clean way for rtracklayer to extract the (optional) FASTA
>> data embedded in a GFF3 file and parse it as an XStringSet? Is there a
>> low-level way to pass in-memory data to the parser in Biostrings?
>>
>
> Not that it can be used here, but readDNAStringSet() has the 'skip' arg
> which is analogous to the 'skip' arg of read.table(), except that, in
> the case of readDNAStringSet(), it needs to be specified as the number
> of records (FASTA or FASTQ) to skip before beginning to read in
> records. So the assumption is that everything before the first record
> to read is valid FASTA (or FASTQ). Which is of course not the case
> with those GFF3 files with embedded FASTA data.
>
> However it would be easy to add another arg, say 'skip.non.fasta.lines',
> to automatically skip lines that don't look like the header of a FASTA
> record (i.e. that don't start with '>').
>
>
>
>> In terms of the API, import,GFFFile could return a GRanges with the
>> DNAStringSet in the metadata(). Or there could be a method for
>> readDNAStringSet on GFF3File that returns the DNAStringSet directly.
>>
>
> The readDNAStringSet,GFF3File method seems cleaner than the metadata()
> solution. It's also lower-level and would be needed behind the scene by
> import,GFFFile, so I think it would make sense to start with it.
> Implementing readDNAStringSet,GFF3File will be trivial once we have
> something like the 'skip.non.fasta' arg. Should I go for it? Any better
> suggestion for the name of this arg?
>
> Thanks,
> H.
>
>
>> It turns out this functionality is useful when working with microbial
>> genomes, where information tends to be passed around as Genbank files. For
>> right now the easiest path seems to be to convert Genbank to GFF, but a
>> Genbank parser in Bioc could be an eventual goal. It's a very complex file
>> format.
>>
>> Michael
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] parsing embedded FASTA data

Reply via email to