Hi all,

 

The folk on the Flybase help mailing list were able to answer my questions
regarding the persimilis and willistoni scaffolding naming discrepancies
between the UCSC genome browser and Flybase:

 

1.       D. persimilis:

There is a one-to-one relationship between the UCSC "scaffold_" prefixed
headers and the "super_" prefixed headers. For example, scaffold_14
corresponds to super_14. 

 

2.       D. willistoni:

The mapping is not so trivial as it is for persimilis. Both UCSC genome
browser and Flybase uses the AAA CAF1 assemblies produced in 2007. However,
89 scaffolds from the CAF1 assembly for Dwil were later suppressed from the
CAF1 assembly before it was added to GenBank because these 89 scaffolds
mapped to Wolbachia. The scaffold IDs were also changed to avoid confusion
with the previous (original) assembly. Explicitly, it appears that UCSC
genome browser is using the unsuppressed, original willistoni assembly and
Flybase is using the modified assembly. 

 

A mapping between the original AAA version and what exists in
GenBank/FlyBase can be found here:

ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/special_requests/CAF1
/dwil/dwil_scaffold2GenBank

 

A full list of problematic scaffolds that were found during the GenBank
submission process (for all 12 Drosophila genomes) can be found here:

ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/special_requests/CAF1
/foreign_scaffolds_in_caf1.txt

 

 

 

 

This explanation has been reworded from the response I received on the
Flybase help mailing list. I thank everyone for their generous help with
this ambiguity. Hopefully others who encounter  this discrepancy will find a
suitable answer in this thread.

 

Thanks,

Jaaved

 

--
Jaaved Mohammed,
Ph.D. Student of Computational Biology 
Tri-Institutional Training Program in Computational Biology and Medicine 
(Cornell University - Ithaca, Weill Cornell Medical College, and Memorial
Sloan-Kettering Cancer Center)

 

 

From: Greg Roe [mailto:[email protected]] 
Sent: Tuesday, September 27, 2011 8:06 PM
To: Jaaved Mohammed
Cc: [email protected]
Subject: Re: [Genome] super vs scaffold coordinates & D. willistoni on the
browser.

 

Hi Jaaved,

We ran faCount, so no need to do that yourself:

http://www.broadinstitute.org/ftp/pub/assemblies/insects/droSec1/assembly.ba
ses.gz
21,424 contigs (UCSC: 14,730 super contigs)
http://www.broadinstitute.org/ftp/pub/assemblies/insects/droPer1/assembly.ba
ses.gz
26,812 contigs (UCSC: 12,838 super contigs)

As stated before, the assemblies hosted at UCSC have not been updated for
quite some time. Obviously at lot of work has been done on these organisms
since. You would have to go track down what labs produced the newer data,
etc, in order to answer your questions.  We just don't have that
information.

Please let us know if you have any additional questions: [email protected]

-
Greg Roe
UCSC Genome Bioinformatics Group 


On 9/19/11 6:53 AM, Jaaved Mohammed wrote: 

Hi Vanessa,
 
Thanks for your response. Can you help point me to the download site
with the latest assembly for either of the 3 fly species for which the
engineer speaks of.
 
I can find multiple nucleotide sequences across several sites. For
example, for D. willistoni, I can find an assembly at LBNL
(http://rana.lbl.gov/drosophila/assemblies.html), and from NCBI, I can
download all the raw sequences from
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7260. I
could not find any of the insects up on NCBI Ensembl genome browser
either.  Can you help point me in the right direction.
 
Thanks,
Jaaved
 
 
 
On Fri, Sep 16, 2011 at 3:06 PM, Vanessa Kirkup Swing
 <mailto:[email protected]> <[email protected]> wrote:

Hi Jaaved,
 
To answer your first question:
 
The genomes we have are old so it is possible that the differences may
be due to years of version updates.
 
On of our engineers has this to say:
Go to the current download site for these genomes, fetch the sequence
file, and run an faCount on
it to see what they name the bits. Compare names and genome
organization with what we display.
I would assume after 5 or 6 years, these genomes most likely have new
assemblies. These genome
project sites would most likely explain their update history. You may
also find assembly history
in the browsers at Ensembl. There may also be information on their
trace archive pages if
they have them. For example:
http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=AAMC01
 
To answer your second question:
 
Unfortunately, our funding covers primarily vertebrate genomes, though
we do host a few of the major model organisms.
 
Hope this help you. If you have further questions, please contact the
mailing list: [email protected].
 
Vanessa Kirkup Swing
UCSC Genome Bioinformatics Group
 
 
---------- Forwarded message ----------
From: Jaaved Mohammed  <mailto:[email protected]> <[email protected]>
Date: Thu, Sep 15, 2011 at 8:57 AM
Subject: [Genome] super vs scaffold coordinates & D. willistoni on the
browser.
To: [email protected]
 
 
Hello,
 
I have two questions that I would really appreciate your help with
answering.
 
===========
Firstly,
===========
 
I am trying to understand the origin of the "super*" coordinates for the
droPer1 and droSec1 genomes available on the UCSC Genome Browser.
 
For example, in the D. sechellia assembly, I see that all the chromosomes
are prefixed by "super" on the Genome Browser:
http://genome-mirror.bscb.cornell.edu/cgi-bin/hgTracks?hgsid=36382
<http://genome-mirror.bscb.cornell.edu/cgi-bin/hgTracks?hgsid=36382&chromInf
o> &chromInfo
Page=. However, from Flybase.org, the GFF files, or any coordinate for that
matter on Flybase, is always prefixed by "scaffold" as can be seen from
ftp://flybase.net/genomes/Drosophila_sechellia/current/gff/.
Why is this? How were the conversion done from "scaffold" into "super"
coordinates? I'm trying to convert the flybase genes reported in the GFF
files into a file that I can upload to the browser to see the flybase
annotated genes, non-coding RNAs, etc. however this clash of coordinate
names is causing much problems.
 
I should note that I looked in all the older revisions of the Flybase GFF
files and still I see no "super" prefixed coordinates. I hope I'm not
looking at the wrong flybase GFF files.
 
The same observation was made in the droPer1 reference assembly.
 
=============
Secondly, I've noticed that D. willistoni reference assembly is not
available on the UCSC Genome Browser. Why is this?
 
I've added this genome to the Cornell mirror using the droWil1.fa file
downloaded/available from the UCSC browser. The added genome can be viewed
here:
http://genome-mirror.bscb.cornell.edu/cgi-bin/hgGateway?hgsid=36387
<http://genome-mirror.bscb.cornell.edu/cgi-bin/hgGateway?hgsid=36387&clade=i
n> &clade=in
sect&org=D.+willistoni&db=0
 
On a similar note to the first point above, I've observed that the
coordinates are prefixed with "scaffold" on the browser, but flybase reports
coordinates prefixed with "scf2_":
ftp://flybase.net/genomes/Drosophila_willistoni/current/gff/.
 
 
Thanks,
Jaaved
 
 
--
Jaaved Mohammed,
Ph.D. Student of Computational Biology
Tri-Institutional Training Program in Computational Biology and Medicine
(Cornell University - Ithaca, Weill Cornell Medical College, and Memorial
Sloan-Kettering Cancer Center)
 
 
 
 
 
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome
 

 
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to