Hi Qian,

This is the source of the mismatch. In fasta format, any text between the leading ">" and the first whitespace (tab, space, etc) is considered the "identifier".
http://wiki.g2.bx.psu.edu/Support#Error_from_tools
http://wiki.g2.bx.psu.edu/Learn/Datatypes#Fasta

Which in your case is this:

   gi|289546492|ref|NC_011420.2|

The identifier must be changed to be only this:

   NC_011420.2

to be consistent with the GFF file. There are many ways to do this, one simple way is to use the tools in 'Text Manipulation'. Use 'Remove beginning of a file' to strip the first line. Then create a plain text file on your computer, one line, with just the identifier content like this:

   >NC_011420.2

Very important - do not make the file more than one line long - meaning do not add in an extra 'return' at the end of the first line. The plain text file just needs these characters - no more - including no whitespace characters (spaces, tabs, newlines). Upload this file to your instance as text and "Concatenate datasets" with the sequence file that had the first line removed, placing this identifier dataset on top. Then change the datatype back to ".fasta" on the final dataset, if necessary (click on the pencil icon).

Then re-run Bowtie to have the identifiers in your SAM file consistent with the GFF file.

After running Bowtie, be certain to sort the resulting dataset before running Cufflinks. Tophat produces sorted output, but Bowtie does not and Cufflinks requires sorted input.
http://main.g2.bx.psu.edu/u/jeremy/p/transcriptome-analysis-faq#faq2

Take care,

Jen
Galaxy team

On 9/26/12 12:54 PM, Qian Dong wrote:
Dear Jennifer,

My .fasta reference genome is like this:

'>gi|289546492|ref|NC_011420.2| Rhodospirillum centenum SW chromosome,
complete genome'

and in the SAM file generated by BOWTIE it says:


@SQ     SN:gi|289546492|ref|NC_011420.2|        LN:4355543

So I think they are all the same as "NC_011420.2". Is there anything else I can 
try?

Thank you,

Qian


On Wed, Sep 26, 2012 at 3:35 PM, Jennifer Jackson <j...@bx.psu.edu
<mailto:j...@bx.psu.edu>> wrote:

    Hello,

    The first thing to double check is that the chromosome identifier is
    an exact match between the reference genome and the reference
    annotation.

    The GFF3 file is naming the chromosome "NC_011420.2".

    The reference annotation chromosome should be named exactly the same
    way. Check this in the input BAM/SAM datasets or the original .fasta
    reference genome.

    Hopefully this finds the problem. Correcting mismatched names (due
    to various reasons) is the most common solution to this sort of issue:
    'Tools on the Main server: Example', bullet item #2:
    http://wiki.g2.bx.psu.edu/__Support#Interpreting___scientific_results 
<http://wiki.g2.bx.psu.edu/Support#Interpreting_scientific_results>

    Best,

    Jen
    Galaxy team



    On 9/26/12 8:46 AM, Qian Dong wrote:

        Dear Team,

        I've been having a problem with cufflink regarding GFF files. I
        tried
        searching the mailing list first and failed to find an answer.
        Could you
        help me look at this?

        I downloaded my genome annotation GFF file from NCBI (soon I
        realized
        NCBI format may be a problem) for my bacterial RNA-seq data
        analysis. My
        GFF file looks like the following:

        '
        ##gff-version 3
        #!gff-spec-version 1.20
        #!processor NCBI annotwriter
        ##sequence-region NC_011420.2 1 4355543 <tel:011420.2%201%204355543>
        ##species
        http://www.ncbi.nlm.nih.gov/__Taxonomy/Browser/wwwtax.cgi?__id=414684
        <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=414684>
        NC_011420.2     RefSeq  region  1       4355543 .       +       .
        
ID=id0;Dbxref=taxon:414684;Is___circular=true;culture-__collection=ATCC:51521;gb-__synonym=Rhodocista
        centenaria SW;gbkey=Src;genome=__chromosome;mol_type=genomic
        DNA;strain=SW%3B ATCC 51521
        NC_011420.2     RefSeq  gene    11      3343    .       +       .
        
ID=gene0;Name=RC1_0011;Dbxref=__GeneID:7008893;gbkey=Gene;__locus_tag=RC1_0011
        NC_011420.2     RefSeq  CDS     11      3343    .       +       0
        ID=cds0;Name=YP_002296275.1;__Parent=gene0;Note=Contains a type I
        secretion target ggxgxdxxx repeat %282 copies%29 domain%3B
        Contains a
        Cadherin domain%3B identified by match to protein family HMM
        
PF02789;Dbxref=Genbank:YP___002296275.1,GeneID:7008893;__gbkey=CDS;product=hypothetical
        protein;protein_id=YP___002296275.1;transl_table=11


        I used this file for cufflink but all the FPKM values are 0.  I
        checked
        out this link: http://cufflinks.cbcb.umd.edu/__gff.html
        <http://cufflinks.cbcb.umd.edu/gff.html> and thought that
        maybe the problem is because I don't have any mRNA feature in my gff
        file. Since I am dealing with a bacterial genome, there is no
        exon/intron or UTR info needed. Therefore I modified my GFF file
        into
        the following:

        ##gff-version 3
        #!gff-spec-version 1.20
        #!processor NCBI annotwriter
        ##sequence-region NC_011420.2 1 4355543 <tel:011420.2%201%204355543>
        ##species
        http://www.ncbi.nlm.nih.gov/__Taxonomy/Browser/wwwtax.cgi?__id=414684
        <http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=414684>
        NC_011420.2     RefSeq  region  1       4355543 .       +       .
        
ID=id0;Dbxref=taxon:414684;Is___circular=true;culture-__collection=ATCC:51521;gb-__synonym=Rhodocista
        centenaria SW;gbkey=Src;genome=__chromosome;mol_type=genomic
        DNA;strain=SW%3B ATCC 51521
        NC_011420.2     RefSeq  mRNA    11      3343    .       +       .
        
ID=mRNA0;Name=RC1_0011;Dbxref=__GeneID:7008893;gbkey=Gene;__locus_tag=RC1_0011
        NC_011420.2     RefSeq  CDS     11      3343    .       +       0
        ID=cds0;Name=YP_002296275.1;__Parent=mRNA0;Note=Contains a type I
        secretion target ggxgxdxxx repeat %282 copies%29 domain%3B
        Contains a
        Cadherin domain%3B identified by match to protein family HMM
        
PF02789;Dbxref=Genbank:YP___002296275.1,GeneID:7008893;__gbkey=CDS;product=hypothetical
        protein;protein_id=YP___002296275.1;transl_table=11



        I re-ran cufflink however this time there is error reported. I
        can only
        tell from the report that there is a segmentation fault but not
        further
        details. The report is as follows:

        Error running cufflinks.
        return code = 139
        Command line:
        cufflinks -q --no-update-check -I 100 -F 0.100000 -j 0.150000 -p
        4 -G /galaxy/test_pool/pool5/files/__000/327/dataset_327777.dat
        /galaxy/test_database/files/__000/325/dataset_325086.dat
        [19:41:41] Loading reference annotation.
        Segmentation fault

        cp: cannot stat
        
`/galaxy/test_pool/pool3/tmp/__job_working_directory/000/170/__170197/global_model.txt':
        No such file or directory
        cp: cannot stat
        
`/galaxy/test_pool/pool3/tmp/__job_working_directory/000/170/__170197/isoforms.fpkm_tracking'__:
        No such file or directory
        cp: cannot stat
        
`/galaxy/test_pool/pool3/tmp/__job_working_directory/000/170/__170197/genes.fpkm_tracking':
        No such file or directory


        My questions will be:

        1. Is there any way to modify a NCBI bacterial genome annotation GFF
        file to make it usable for cufflink? Our genome annotation is only
        available in NCBI, not ensemble or USDC so this is pretty much
        my only
        choice..

        2. Should I proceed with modifying the GFF file or should I
        convert it
        into GTF and use the GTF instead in cufflink?

        I am a biochemist and really new to the computer world so any advice
        will help!

        Thanks a lot,

        Qian
        --
        Qian Dong
        Bauer Lab, MCBD
        Simon Hall: 313-317
        212 S. Hawthorne Dr.
        Bloomington, IN 47405
        Email:do...@indiana.edu <mailto:email%3ado...@indiana.edu>
        <mailto:Email%3Adong3@indiana.__edu
        <mailto:email%253ado...@indiana.edu>>
        Lab Phone:812-855-8443 <tel:812-855-8443>



        _____________________________________________________________
        The Galaxy User list should be used for the discussion of
        Galaxy analysis and other features on the public server
        at usegalaxy.org <http://usegalaxy.org>.  Please keep all
        replies on the list by
        using "reply all" in your mail client.  For discussion of
        local Galaxy instances and the Galaxy source code, please
        use the Galaxy Development list:

        http://lists.bx.psu.edu/__listinfo/galaxy-dev
        <http://lists.bx.psu.edu/listinfo/galaxy-dev>

        To manage your subscriptions to this and other Galaxy lists,
        please use the interface at:

        http://lists.bx.psu.edu/


    --
    Jennifer Jackson
    http://galaxyproject.org




--
Qian Dong
Bauer Lab, MCBD
Simon Hall: 313-317
212 S. Hawthorne Dr.
Bloomington, IN 47405
Email:do...@indiana.edu <mailto:email%3ado...@indiana.edu>
Lab Phone:812-855-8443


--
Jennifer Jackson
http://galaxyproject.org
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

Reply via email to