2009/10/3 Leo Goodstadt <[email protected]>: >> >> Is there a way to quickly extract out the coordinates from a gff file >> >> and the corresponding sequence from a fasta file? >> >> >> > This seems of such general use that it begs a small utility which will > take a (possibly indexed) fasta file, a gff and output the sequences you > want. What would people want from such a programme?
At least one user wants the following: given a GFF file, produce a multi Fasta sequence file with one sequence from each 'feature' in the GFF file. Each feature sequence should be derived from the corresponding reference sequence. Features should probably be restricted to certain types, as zero length or single base features are probably not that interesting. > Is GTF (http://mblab.wustl.edu/GTF2.html) more useful or GFF? > Would different elements from the same group (gene/transcript) be joined > together in order? I don't think so. I think GTF was invented to overcome some limitations with GFF2. However, GFF3 is now the standard: http://www.sequenceontology.org/gff3.shtml (I can't believe how incredibly annoying the background image for that page is!) As far as I know there are no pending 'improvements' to GFF3. > Would one want filtering on the "features" column so one could retrieve all > splice sites or codon exons? That would be a nice feature, and would be easy to implement. > What would be the output? Another fasta file? How would each "group" of > Sequences (e.g. transcript) be labelled? By a user supplied regular > expression? I think the required output is a multi Fasta file. The GFF3 format requires each feature to have a unique ID, so I'd suggest simply using that as the sequence ID (no point re-inventing the wheel). You could then include the feature name (if present) and the reference sequence id in the remainder of the Fasta def line (http://en.wikipedia.org/wiki/FASTA_format). >> I guess it depends what you mean by quick- quick to write you could use awk >> but then it depends what additional things you want to do with results.=20 >> I ended up writing a C++ fasta utility program since PERL can slow down som= >> etimes but I ended up grabbing a couple of regex libraries to let me=20 >> grep names etc.=20 > I hoped you used boost:regex which will be in the next c++ standard > (http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/index.html) and > is as easy to use and powerful as perl/python regular expressions (though > c rules on escaping backslashes are a pain). > Leo > Leo Goodstadt One thing to consider: if the reference sequence isn't part of the GFF file and/or isn't passed as a separate Fasta file, the DAS registry could be queried in order to obtain the URI of a reference server that provides the sequence: http://www.dasregistry.org/ That takes the project one step beyond a simple parser, so its something to think about rather than an explicit requirement. Otherwise I think your right, A little tool to do what you suggested could be very useful! Cheers, Dan. > _______________________________________________ > BBB mailing list > [email protected] > http://www.bioinformatics.org/mailman/listinfo/bbb > _______________________________________________ BBB mailing list [email protected] http://www.bioinformatics.org/mailman/listinfo/bbb
