Hello Vinayak, For tracks such as RefSeq, an alignment details table is available that notes mismatches, inserts/deletions, etc. for that sequence versus genomic. For this track, the table name is refSeqAli and can be found in the Assembly browser (to view) and Downloads/Table Browser for extraction.
For EST data, the tables are of format chr1_intronEst. The data is too large to represent in a single table, so is broken down by chromosome. The only option to obtain the data for this track is ftp to Downloads. When we do the alignment with BLAT (rather than receiving the coordinates from an external source) a table in this PSL format will associated with the track (but not always as the primary table). This is especially true for all Genbank source data. Some investigation using the Assembly or Table browser will be necessary on your part to located the data for all track/data types. FAQ for PSL format: http://genome.ucsc.edu/FAQ/FAQformat#format2 A few things to note: 1) any dataset larger than 100k rows, use ftp to Downloads. The Table Browser does not support query results of that size 2) often a gene/gene prediction track and occasionally a mrna track represents transcripts that are not a "single read" but rather a consensus sequence built of many reads (from various clone/library/tissue sources - rarely from a single clone). And occasionally genomic itself is used to resolve the tricker parts of the genes (regions overlapping intra-gene duplicated segments ex. zinc finger proteins, transmembrane receptors, certain disease/immunological genes with variable regions or overlap/inclusion of simple repeats that are difficult for automated base callers to resolve). So, a comparison between the genomic and transcript could be biased falsely towards a "better match". This includes certain RefSeq sequences, in particular the "predicted" class and older versions, where the goal was to create a set of the "most common" variants, not a complete set of observed variants. Each track is different - read the methods section and contact the data source if anything is unclear from a scientific perspective or if alignment details are not documented (for tracks not aligned by BLAT). 3) direct reads (ESTs) that are linked to clone/lib/tissue are the ultimate source of observed variation. However, the redundancy (multiple read for a single clone), depth of coverage, and inherit (and often unavoidable) base-calling errors associated with the technology should be a consideration in any analysis that uses them. But the results can provide the purest and most statistically satisfying range of variation once these factors are compensated for. Good luck and let us know if you need help not covered in our Help/FAQ, Jennifer Jackson UCSC Genome Bioinformatics Group Vinayak Kulkarni wrote: > Dear UCSC folks, > I wanted to get some statistics on how identical the UCSC genome sequence is > when compared to the actual transcript database sequences like the refseq. > For example for the Gene EZH2, > http://genome.ucsc.edu/cgi-bin/hgc?hgsid=132900058&o=148135407&t=148212347&g=refGene&i=NM_004456&c=chr7&l=148135407&r=148212347&db=hg18&pix=800 > The alignment columns says 100%. Does that mean if I grab the sequence > bolcks from UCSC and assemble them I would replicate the Refseq sequence? > > Could you please let where I could such data for all the transcripts, to get > an estimate of how much of each source, eg Refseq, Ensembl are different > when you compare their transcript sequences to the > genomic sequence? > > Thank you very much, > Vinayak. > > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
