Hi Lindsay,

Nice to hear that Vipin's server worked out for you.

A reduction in the number of lines is completely expected. You should have the same number of lines in the BED12 file as you have unique "ID"s in the GFF3 file, if you want to double check. This is the same as the number of unique transcripts in any of the files, if summed up correctly.

Why is this? A GFF3 file has one line per feature. Part of an exon, etc. with further (this can vary) annotation that tells you if it is 5' UTR, or CDS, etc. When converting GFF3 -> BED12, all the features for a particular transcript are rolled up into a single line of output. Transcripts will have a unique "ID" by GFF3 specification - and all features that belong to that transcript will have it in the GFF3 file's attributes field (it is usually the first attribute listed).

When you converted just GFF3 -> BED6, each feature was individually converted into a distinct line of BED formatted output, there was no summing up to group the information for the transcript as a whole.

The Galaxy wiki has more - GFF3 can represent other kinds of data, and that is covered in the link-outs to the original spec, but this is the gist of the meaning for your specific data.

Glad I could help before, and that Vipin had the tool online & available to simplify things!

Best,

Jen
Galaxy team

On 11/12/13 4:12 PM, Lindsay Rutter wrote:
Hello Jennifer:

Thank you for your very helpful and detailed explanation!

I ended up using the site that Vipin provided in his message (https://galaxy.cbio.mskcc.org/tool_runner?tool_id=fml_gff2bed).

Indeed, it produced a file with 12 columns that appears to be in a reasonable bed format (including the 10th column being an integer, with the 11th and 12th column consisting of that integer number of items separated by commas).

However, I noticed that the number of lines in the file went from 183,748 (in the .gff3 file) to 11,506 (in the 12-column .bed file).

In your opinion, does this seem like a reasonable reduction in number of lines? I was not expecting that (in fact, I was expecting the number of lines would stay the same, as they did going from .gff3 to 6-column .bed file), but I am quite inexperienced using these files.

Thanking you...
Lindsay




On Mon, Nov 11, 2013 at 11:15 AM, Jennifer Jackson <j...@bx.psu.edu <mailto:j...@bx.psu.edu>> wrote:

    Hello,

    There are no tools directly on the public Galaxy site to transform
    a GFF3 dataset into a BED12 dataset. However, the Tool Shed has a
    repository called ' fml_gff3togtf' that includes a tool for this
    purpose, for use in a local install. The description is a bit
    bothersome in that it a slightly incorrect datatype statement, so
    be sure to test out the results. (the word "wiggle" has no place
    in this statement: "
    gff3_to_bed_converter.py: This tool converts gene transcript annotation from 
GFF3 format to UCSC wiggle 12 column BED format.")
    http://getgalaxy.org
    http://usegalaxy.org/toolshed

    I see your post at Biostar, and it might be helpful to let you
    know what a BED12 file represents (plus I'll post this there, may
    help others):
    http://www.biostars.org/p/85869/

    A BED12 file describes the complete, often spliced, alignment of a
    sequence to a reference genome. This does not include minor base
    variation, it is a macro alignment. You can think of each of the
    blocks as being "exons", although there is no magic here - if the
    sequence or genome had quality problems, or significant variation
    (large insertion or deletion), that could cause the alignment to
    fragment as well.
    Here is the data description:
    http://wiki.galaxyproject.org/Learn/Datatypes#Bed

    To see examples, at UCSC (genome.ucsc.edu
    <http://genome.ucsc.edu>), EST or mRNA track will have this as the
    primary table format. All gene track can also be in BED12 format,
    or in a related one, genePred:
    http://genome.ucsc.edu/FAQ/FAQformat.html#format9

    UCSC also has line-command utilities to convert between the
    formats, pre-compiled versions are here:
    http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads

    Either way, you can convert the data, then load up into the public
    Galaxy (usegalaxy.org <http://usegalaxy.org>) and proceed with
    your analysis. BEDTools works well with BED12 files. There is
    definitely information loss attempting to transform BED6 -> BED12,
    as the global alignment is lost. And adjusting attributes such as
    score or name are often a preference, so you can alter these
    however you want, as long as the attribute formatting rules for
    the columns are followed.

    Hopefully this helps,

    Jen
    Galaxy team


    On 11/9/13 3:29 PM, lrutter @iastate.edu <http://iastate.edu> wrote:
    Hello Galaxy:

    I am trying overall to convert a .gff3 file to 12-column .bed file.

    I first tried GFF-to-BED converter, but it gave a 6-column .bed file.

    Then, I tried BED-to-bigBed converter by inputting the 6-column
    .bed file. I get an error "Unspecified genome build, click the
    pencil icon in the history item to set the genome build".

    So, I click the pencil icon, and see 4 tabs at the top. I set the
    "Attributes" tab as in the attached image (Attributes.png).

    But then, when I select "Convert Format", I am only seeing an
    option that outputs .bed12 file as "Convert Genomic Intervals to
    Strict BED12". I am a bit confused about this because I specified
    the input file as a .bed file (and not genomic intervals, unless
    I am misunderstanding something).

    In any case, when I select "Convert Genomic Intervals to Strict
    BED12", I do get a .bed file with 12 columns. But I would like to
    ask if I may have lost information going from the .gff3 to
    .bed(6) to .bed(12)?

    (I feel that scores were all set to "0" from .gff3 to .bed(6),
    and columns 10, 11, 12 (block counts, sizes, and starting
    positions) were all set to zero going from .bed(6) to .bed(12)).

    If I am correct that there is information loss, is there a system
    in Galaxy to prevent this, and transfer as much information as
    possible from .gff3 to .bed(12)?

    Thank you.
    L. Rutter

    ** Below is a head of my three files (the species is P. dominula):

    .gff3 file

    ##gff-version 3
    ##date Mon Nov  4 14:54:42 2013
    ##source gbrowse gbgff gff3 dumper
    PdomScaf0001    maker   gene    15      1963  .       -       .
    Name=PdomGene00025;ID=1;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274
    PdomScaf0001    maker   mRNA    15      1963  .       -       .
    
Name=PdomMRNA00025.1;Parent=1;ID=2;_QI=216%7C0%7C0.2%7C0.6%7C0.5%7C0.6%7C5%7C0%7C98;_eAED=0.43;_AED=0.43;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274-mRNA-1
PdomScaf0001 maker exon 15 100 -0.094 - . Parent=2;ID=3 PdomScaf0001 maker CDS 15 100 . - 2 Parent=2;ID=4 PdomScaf0001 maker exon 223 300 21.8 - . Parent=2;ID=5 PdomScaf0001 maker CDS 223 300 . - 2 Parent=2;ID=6 PdomScaf0001 maker exon 717 765 22.4 - . Parent=2;ID=7

    .bed(6) file

    PdomScaf0001    14      1963    gene    0 -
    PdomScaf0001    14      1963    mRNA    0 -
    PdomScaf0001    14      100     exon    0 -
    PdomScaf0001    14      100     CDS     0 -
    PdomScaf0001    222     300     exon    0 -
    PdomScaf0001    222     300     CDS     0 -
    PdomScaf0001    716     765     exon    0 -
    PdomScaf0001    716     765     CDS     0 -
    PdomScaf0001    906     947     exon    0 -
    PdomScaf0001    906     947     CDS     0 -

    .bed(12) file

PdomScaf0001 14 1963 gene 0 - 14 1963 0 0 , , PdomScaf0001 14 1963 mRNA 0 - 14 1963 0 0 , , PdomScaf0001 14 100 exon 0 - 14 100 0 0 , , PdomScaf0001 14 100 CDS 0 - 14 100 0 0 , , PdomScaf0001 222 300 exon 0 - 222 300 0 0 , , PdomScaf0001 222 300 CDS 0 - 222 300 0 0 , , PdomScaf0001 716 765 exon 0 - 716 765 0 0 , , PdomScaf0001 716 765 CDS 0 - 716 765 0 0 , , PdomScaf0001 906 947 exon 0 - 906 947 0 0 , , PdomScaf0001 906 947 CDS 0 - 906 947 0 0 , ,



    ___________________________________________________________
    Please keep all replies on the list by using "reply all"
    in your mail client.  To manage your subscriptions to this
    and other Galaxy lists, please use the interface at:
       http://lists.bx.psu.edu/

    To search Galaxy mailing lists use the unified search at:
       http://galaxyproject.org/search/mailinglists/

-- Jennifer Hillman-Jackson
    http://galaxyproject.org



--
Jennifer Hillman-Jackson
http://galaxyproject.org

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to