Re: [Genome] Difference between Refseq and UCSC genomic sequence?

Jennifer Jackson Thu, 21 May 2009 11:41:12 -0700

Hello Vinayak,

For tracks such as RefSeq, an alignment details table is available that 
notes mismatches, inserts/deletions, etc. for that sequence versus 
genomic. For this track, the table name is refSeqAli and can be found in 
the Assembly browser (to view) and Downloads/Table Browser for extraction.

For EST data, the tables are of format chr1_intronEst. The data is too 
large to represent in a single table, so is broken down by chromosome. 
The only option to obtain the data for this track is ftp to Downloads.

When we do the alignment with BLAT (rather than receiving the 
coordinates from an external source) a table in this PSL format will 
associated with the track (but not always as the primary table). This is 
especially true for all Genbank source data. Some investigation using 
the Assembly or Table browser will be necessary on your part to located 
the data for all track/data types. FAQ for PSL format: 
http://genome.ucsc.edu/FAQ/FAQformat#format2

A few things to note:
1) any dataset larger than 100k rows, use ftp to Downloads. The Table 
Browser does not support query results of that size
2) often a gene/gene prediction track and occasionally a mrna track 
represents transcripts that are not a "single read" but rather a 
consensus sequence built of many reads (from various 
clone/library/tissue sources - rarely from a single clone). And 
occasionally genomic itself is used to resolve the tricker parts of the 
genes (regions overlapping intra-gene duplicated segments ex. zinc 
finger proteins, transmembrane receptors, certain disease/immunological 
genes with variable regions or overlap/inclusion of simple repeats that 
are difficult for automated base callers to resolve). So, a comparison 
between the genomic and transcript could be biased falsely towards a 
"better match". This includes certain RefSeq sequences, in particular 
the "predicted" class and older versions, where the goal was to create a 
set of the "most common" variants, not a complete set of observed 
variants. Each track is different - read the methods section and contact 
the data source if anything is unclear from a scientific perspective or 
if alignment details are not documented (for tracks not aligned by BLAT).
3) direct reads (ESTs) that are linked to clone/lib/tissue are the 
ultimate source of observed variation. However, the redundancy (multiple 
read for a single clone), depth of coverage, and inherit (and often 
unavoidable) base-calling errors associated with the technology should 
be a consideration in any analysis that uses them. But the results can 
provide the purest and most statistically satisfying range of variation 
once these factors are compensated for.

Good luck and let us know if you need help not covered in our Help/FAQ,
Jennifer Jackson
UCSC Genome Bioinformatics Group

Vinayak Kulkarni wrote:
> Dear UCSC folks,
> I wanted to get some statistics on how identical the UCSC genome sequence is
> when compared to the actual transcript database sequences like the refseq.
> For example for the Gene EZH2,
> http://genome.ucsc.edu/cgi-bin/hgc?hgsid=132900058&o=148135407&t=148212347&g=refGene&i=NM_004456&c=chr7&l=148135407&r=148212347&db=hg18&pix=800
> The alignment columns says 100%. Does that mean if I grab the sequence
> bolcks from UCSC and assemble them I would replicate the Refseq sequence?
>
> Could you please let where I could such data for all the transcripts, to get
> an estimate of how much of each source, eg Refseq, Ensembl are different
> when you compare their transcript sequences to the
> genomic sequence?
>
> Thank you very much,
> Vinayak.
>
>   
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Difference between Refseq and UCSC genomic sequence?

Reply via email to