On Fri, 3 Dec 2010, Maximilian Haussler wrote:
> Very interesting thread!
>
> Bogdan, if you want to combine the data from the two URLs that Ewan sent
> you, be aware that UCSC is at Version 59 of Ensembl and the Biomart link
> points to version 60 of Biomart, so if Ensembl has changed anything from
> version 59 to version 60 for the human assembly (don't know how to find this
> info on the web at the moment), then you might want to use the Version 59
> Biomart at
> http://aug2010.archive.ensembl.org/biomart/martview/
>
> You just select the checkboxes Attributes / Biotype, Chrom, Start, End and
> click on output to get the lincRNA coordinates.
>
It's always best to stay synchronised on the same release :)
Human does tend to click over a little bit each release because of updates
from Havana moving in (though not necessarily each release).
One way to track this is the database extension name which changes
when the database contents change:
(this is given as <<global_release>>.<<species_specific>>
The species specific is usually assemblynumber<<letter>> where letter
updates on database content change on the same database)
release 60: 60.37e
release 59: 59.37d
(so - as 37e != 37d, there has been some content change)
You can get this from the assembly and stats table at:
http://www.ensembl.org/Homo_sapiens/Info/StatsTable?db=core
and the archive site for 59 release (each page in ensembl is linked
to their archives at the bottem of the page)
http://aug2010.archive.ensembl.org/Homo_sapiens/Info/StatsTable?db=core
There is actually even more granularity on whether the content change
was just Xref or Gene Build as well... but I can't spot that.
> Note that the coordinates from Ensembl and UCSC are not completely
> compatible: You will need to remove all features on chromosome HSCHR6_* or
> on chromosome "LRG" (grep -v), prefix all chromosome numbers with "chr"
> (Excel, gawk, perl) and reorder the columns to get them into GFF or BED
> format.
>
We really must make this easier in the future. So silly to have these
issues. Something for a deeper conversation than this.
If you switch on the biotype to lincRNA, you automatically don't get
LRG's (arguably LRGs should not be coming out in biomart, but arguably
they should... hmmm....)
I think there are other haplotypes than HSCHR6_* right - there is one
on CHR17 I think, so I am not sure that grep does it all.
grep -v HSCHR I think.
> <http://aug2010.archive.ensembl.org/biomart/martview/>cheers
> Max
> --
> Maximilian Haussler
> Tel: +447574246789
> http://www.manchester.ac.uk/research/maximilian.haussler/
>
>
> On Thu, Dec 2, 2010 at 10:17 AM, Ewan Birney <[email protected]> wrote:
>
>>
>>
>> The Ensembl project explicit aims to predict long intergenic non
>> coding RNAs
>> (lincRNAs) using a similar scheme (ie, histone modification patterns)
>> and
>> ESTs/cDNAs without coding potential in both Human and Mouse. They are
>> explicitly
>> characterised as lincRNAs. Like all our "predictions", they are biased
>> towards
>> a high specificity set and backed up by experimental evidence.
>>
>> An example one is here:
>>
>>
>> http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000245883;r=7:99517494-99522910;t=ENST00000499990
>>
>>
>> Looking into the corresponding import of Ensembl into UCSC here:
>>
>>
>> http://genome.ucsc.edu/cgi-bin/hgc?hgsid=173968291&o=99517493&t=99522910&g=ensGene&i=ENST00000499990
>>
>> This transcript is there, but I can't spot the "biotype" slot here -
>> it is just
>> that it is non coding (we have about ~20 other non coding biotypes,
>> eg, snoRNAs,
>> miRNAs etc)
>>
>>
>>
>> (Is this true - UCSC guys, would it be possible to get the concept of
>> BioType in
>> the Ensembl set?)
>>
>>
>> Also the Havana project, which does manual curation, which is both
>> merged in a principled
>> way with the Ensembl set (ie, the Ensembl set is a super-set of Havana
>> at the point of
>> release) and is available in UCSC browser also has a large set of non
>> coding RNAs.
>>
>>
>> A count of lincRNAs in Human and Mouse in Ensembl are:
>>
>> 1443 - in Human
>>
>> 407 - in Mouse.
>>
>>
>> It is probably possible to either download from UCSC and the biotypes
>> from Ensembl with
>> a script to join or of course download the set from ensembl. You might
>> like to use
>> our BioMart tool:
>>
>> (showing our west coast mirror here)
>>
>> http://uswest.ensembl.org/biomart/martview/
>>
>>
>>
>>
>> On 2 Dec 2010, at 07:47, Bogdan Tanasa wrote:
>>
>>> Dear all,
>>>
>>> please could you recommend a track "Genes and Gene Prediction
>>> Tracks" that
>>> has the highest number (with good accuracy) of known/ predicted long
>>> ncRNAs
>>> (lincRNAs, etc) ?
>>>
>>> thanks,
>>>
>>> Bogdan
>>> _______________________________________________
>>> Genome maillist - [email protected]
>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>
>> _______________________________________________
>> Genome maillist - [email protected]
>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>
>
-----------------------------------------------------------------
Ewan Birney. Work: +44 1223 494420
Email: birney "at" ebi.ac.uk
Clerical Assistant: shelley "at" ebi.ac.uk
Please cc shelley for urgent or diary-dependent requests
-----------------------------------------------------------------
_______________________________________________
Genome maillist - [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome