I just implemented this functionality in Galaxy's 'Extract Genomic DNA' tool.
This functionality will be available on our main server in the next couple
weeks and is available now via our development repository (
One note: GTF files produced by Cuff* are unusual in that, for each assembled
transcript, they include a "transcript" element in additional to exons. This
element is problematic because it spans the entire transcript. Hence, in order
to get the sequence data for transcripts in a Cuff* GTF file, you'll want to
select for only exons (use Galaxy's 'Extract Features' tool) and then use the
resultant dataset as input to Extract.
Let us know if you have any questions.
On Jan 28, 2011, at 2:08 PM, Karen Tang wrote:
> I was thinking of something different. Here is a example of a three-exon
> transcript, in gtf format:
> contig00035 Cufflinks transcript 3 22 1000 +
> . gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; contig00035
> Cufflinks exon 3 10 1000 + . gene_id
> "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "1"; contig00035
> Cufflinks exon 13 18 1000 + . gene_id
> "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "2";
> contig00035 Cufflinks exon 20 22 1000 + .
> gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "3";
> and the genome sequence that the transcript comes from is:
> I want the sequence for this transcript: I want to extract from the genome
> sequence the subsequences for positions 3-10, 13-18, and 20-22, and then
> concatenate the three subsequences to create the transcript sequence.
> In this case, it would be AGCGTCTC + ACGCGG + TAT, meaning the transcript
> sequence would be AGCGTCTCACGCGGTAT.
> Is it possible to do this in Galaxy?
> Karen :)
> On Thu, 27 Jan 2011, Jennifer Jackson wrote:
>> Hello Karen,
>> The following general workflow should help you to pull sequences from any
>> 1) cut out the sequence IDs from the query (in this case, a GTF & BED file)
>> and sort them.
>> Text Manipulation -> Cut columns from a table
>> Filter and Sort -> Sort
>> 2) convert the target fasta file to tabular format
>> Convert Formats -> FASTA-to-Tabular converter
>> 3) join the two datasets based on the sequence ID
>> Join, Subtract and Group -> Join two Queries
>> 4) covert to fasta
>> Convert Formats -> Tabular-to-FASTA
>> 5) when starting with a GTF file, there will most likely be duplicates. To
>> remove, use:
>> NGS: QC and manipulation -> Collapse sequences
>> Once you create the actual workflow that performs the job, be sure to save
>> it so that you can just re-use it whenever you need to perform the same
>> task. To do this, from the history pane (most right) use Options -> Extract
>> workflow and following the instructions on the form to customize.
>> Hopefully this helps,
>> Galaxy team
>> On 1/26/11 12:05 PM, Karen Tang wrote:
>>> Hi Galaxy people,
>>> I have transcripts predicted by Cufflinks that are in a gtf file. How
>>> can I extract the sequences corresponding to those transcripts, using
>>> [Cufflinks transcript predictions in gtf file] + [Genome sequence in
>>> FASTA file] ---> [FASTA file of transcript sequences]
>>> My genome is a custom genome (not at UCSC).
>>> I'll also need to do the same thing, except my predicted transcripts are
>>> in a Scripture bed file.
>>> Thanks for your help!
>>> Karen Tang :)
>>> Plant Biology
>>> University of Minnesota
>>> galaxy-user mailing list
> galaxy-user mailing list
galaxy-user mailing list