Re: [galaxy-user] Extract sequences from [gtf file] + [genome FASTA file]

Jeremy Goecks Thu, 10 Feb 2011 14:32:09 -0800

Hi Karen,

I just implemented this functionality in Galaxy's 'Extract Genomic DNA' tool. 
This functionality will be available on our main server in the next couple 
weeks and is available now via our development repository ( 
bitbucket.org/galaxy/galaxy-central/ )


One note: GTF files produced by Cuff* are unusual in that, for each assembled 
transcript, they include a "transcript" element in additional to exons. This 
element is problematic because it spans the entire transcript. Hence, in order 
to get the sequence data for transcripts in a Cuff* GTF file, you'll want to 
select for only exons (use Galaxy's 'Extract Features' tool) and then use the 
resultant dataset as input to Extract.

Let us know if you have any questions.

Thanks,
J.



On Jan 28, 2011, at 2:08 PM, Karen Tang wrote:

> I was thinking of something different.  Here is a example of a three-exon 
> transcript, in gtf format:
> 
> contig00035   Cufflinks       transcript      3       22      1000    +       
> .       gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; contig00035 
> Cufflinks       exon    3       10      1000    +       .       gene_id 
> "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "1"; contig00035      
>   Cufflinks       exon    13      18      1000    +       .       gene_id 
> "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "2";
> contig00035   Cufflinks       exon    20      22      1000    +       .       
> gene_id "CUFF.23955"; transcript_id "CUFF.23955.1"; exon_number "3";
> 
> and the genome sequence that the transcript comes from is:
> 
>> contig00035
> GTAGCGTCTCCGACGCGGATATGACCGCACGCTGATGCTCCCAGGGATGAGAGGCGTGCG
> 
> I want the sequence for this transcript: I want to extract from the genome 
> sequence the subsequences for positions 3-10, 13-18, and 20-22, and then 
> concatenate the three subsequences to create the transcript sequence.
> 
> In this case, it would be AGCGTCTC + ACGCGG + TAT, meaning the transcript 
> sequence would be AGCGTCTCACGCGGTAT.
> 
> Is it possible to do this in Galaxy?
> 
> Karen :)
> 
> On Thu, 27 Jan 2011, Jennifer Jackson wrote:
> 
>> Hello Karen,
>> 
>> The following general workflow should help you to pull sequences from any 
>> source.
>> 
>> 1) cut out the sequence IDs from the query (in this case, a GTF & BED file) 
>> and sort them.
>> Text Manipulation -> Cut columns from a table
>> Filter and Sort -> Sort
>> 2) convert the target fasta file to tabular format
>> Convert Formats ->  FASTA-to-Tabular converter
>> 3) join the two datasets based on the sequence ID
>> Join, Subtract and Group -> Join two Queries
>> 4) covert to fasta
>> Convert Formats -> Tabular-to-FASTA
>> 5) when starting with a GTF file, there will most likely be duplicates. To 
>> remove, use:
>> NGS: QC and manipulation -> Collapse sequences
>> 
>> Once you create the actual workflow that performs the job, be sure to save 
>> it so that you can just re-use it whenever you need to perform the same 
>> task. To do this, from the history pane (most right) use Options -> Extract 
>> workflow and following the instructions on the form to customize.
>> 
>> Hopefully this helps,
>> 
>> Jen
>> Galaxy team
>> 
>> On 1/26/11 12:05 PM, Karen Tang wrote:
>>> Hi Galaxy people,
>>> I have transcripts predicted by Cufflinks that are in a gtf file. How
>>> can I extract the sequences corresponding to those transcripts, using
>>> Galaxy?
>>> [Cufflinks transcript predictions in gtf file] + [Genome sequence in
>>> FASTA file] ---> [FASTA file of transcript sequences]
>>> My genome is a custom genome (not at UCSC).
>>> ---------
>>> I'll also need to do the same thing, except my predicted transcripts are
>>> in a Scripture bed file.
>>> Thanks for your help!
>>> Karen Tang :)
>>> Plant Biology
>>> University of Minnesota
>>> _______________________________________________
>>> galaxy-user mailing list
>>> [email protected]
>>> http://lists.bx.psu.edu/listinfo/galaxy-user
>> 
>> 
> _______________________________________________
> galaxy-user mailing list
> [email protected]
> http://lists.bx.psu.edu/listinfo/galaxy-user


_______________________________________________
galaxy-user mailing list
[email protected]
http://lists.bx.psu.edu/listinfo/galaxy-user

Re: [galaxy-user] Extract sequences from [gtf file] + [genome FASTA file]

Reply via email to