Very interesting thread! Bogdan, if you want to combine the data from the two URLs that Ewan sent you, be aware that UCSC is at Version 59 of Ensembl and the Biomart link points to version 60 of Biomart, so if Ensembl has changed anything from version 59 to version 60 for the human assembly (don't know how to find this info on the web at the moment), then you might want to use the Version 59 Biomart at http://aug2010.archive.ensembl.org/biomart/martview/
You just select the checkboxes Attributes / Biotype, Chrom, Start, End and click on output to get the lincRNA coordinates. Note that the coordinates from Ensembl and UCSC are not completely compatible: You will need to remove all features on chromosome HSCHR6_* or on chromosome "LRG" (grep -v), prefix all chromosome numbers with "chr" (Excel, gawk, perl) and reorder the columns to get them into GFF or BED format. <http://aug2010.archive.ensembl.org/biomart/martview/>cheers Max -- Maximilian Haussler Tel: +447574246789 http://www.manchester.ac.uk/research/maximilian.haussler/ On Thu, Dec 2, 2010 at 10:17 AM, Ewan Birney <[email protected]> wrote: > > > The Ensembl project explicit aims to predict long intergenic non > coding RNAs > (lincRNAs) using a similar scheme (ie, histone modification patterns) > and > ESTs/cDNAs without coding potential in both Human and Mouse. They are > explicitly > characterised as lincRNAs. Like all our "predictions", they are biased > towards > a high specificity set and backed up by experimental evidence. > > An example one is here: > > > http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000245883;r=7:99517494-99522910;t=ENST00000499990 > > > Looking into the corresponding import of Ensembl into UCSC here: > > > http://genome.ucsc.edu/cgi-bin/hgc?hgsid=173968291&o=99517493&t=99522910&g=ensGene&i=ENST00000499990 > > This transcript is there, but I can't spot the "biotype" slot here - > it is just > that it is non coding (we have about ~20 other non coding biotypes, > eg, snoRNAs, > miRNAs etc) > > > > (Is this true - UCSC guys, would it be possible to get the concept of > BioType in > the Ensembl set?) > > > Also the Havana project, which does manual curation, which is both > merged in a principled > way with the Ensembl set (ie, the Ensembl set is a super-set of Havana > at the point of > release) and is available in UCSC browser also has a large set of non > coding RNAs. > > > A count of lincRNAs in Human and Mouse in Ensembl are: > > 1443 - in Human > > 407 - in Mouse. > > > It is probably possible to either download from UCSC and the biotypes > from Ensembl with > a script to join or of course download the set from ensembl. You might > like to use > our BioMart tool: > > (showing our west coast mirror here) > > http://uswest.ensembl.org/biomart/martview/ > > > > > On 2 Dec 2010, at 07:47, Bogdan Tanasa wrote: > > > Dear all, > > > > please could you recommend a track "Genes and Gene Prediction > > Tracks" that > > has the highest number (with good accuracy) of known/ predicted long > > ncRNAs > > (lincRNAs, etc) ? > > > > thanks, > > > > Bogdan > > _______________________________________________ > > Genome maillist - [email protected] > > https://lists.soe.ucsc.edu/mailman/listinfo/genome > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
