Hello Dave,

It looks as if you are finding RefSeq sequences with incomplete 
transcripts (CDS intact, but one of the UTRs missing).

For the example NM_000462, the 5' UTR is not included in the GenBank 
record, which we use as a data source to annotated the UTRs and CDS regions:
http://www.ncbi.nlm.nih.gov/nuccore/19718765?report=GenBank

The RefSeq data can include non-coding and partial transcripts. By 
examining the data in the Browser, I was able to identify another 
variant that does have an intact 5' UTR region: NM_130839.1. Manually 
examining each transcript variant set associated with a gene is not 
practical for large datasets, but it was a good sanity check to find out 
what was going on in that region of the genome. I also opened up the 
UCSC Genes track as a reference, since RefSeq transcripts are 
incorporated into that track as an input.

The analysis ideas I wanted to share to help you, in case another 
question like this comes up about content. It sometimes helps to just 
take a look at a sample of unexpected data in context, against the 
reference genome with other relevant tracks open. But, please feel free 
to ask the mailing list if you have questions again.

Best regards,
Jen

UCSC Genome Browser Support
http://genome.ucsc.edu/contacts.html
[email protected]  [email protected]

On 6/25/10 9:47 AM, Dave Tang wrote:
> Dear Genome List,
>
> While using the table browser to fetch different regions of refseqs gene
> models (genome: hg19, group: mRNA and EST tracks, table: refseq genes and
> output format: BED), I find different numbers of refGene ids. I would
> expect to find the exact number of refGene ids in each region bed file.
>
> For example, using the table browser I got all the 5' UTR and 3' UTR
> refGene regions. Here are the first lines of each respective bed file:
>
> #5' UTR bed file
> chr1  66999824        67000041        NM_032291_utr5_0_0_chr1_66999825_f      
> 0       +
>
> #3' UTR bed file
> chr1  67208778        67210767        NM_032291_utr3_24_0_chr1_67208779_f     
> 0       +
>
> This is good because the refseq id (NM_032291) is in both. I parsed each
> file to get a non-redundant list of all the refGene ids in each bed file
> and found different numbers. For example, the refseq id NM_000462 is only
> in the 3' UTR bed file and not the 5' UTR.
>
> In total there are 30408 and 30627 refGene ids in the 5' and 3' UTR bed
> files, respectively.
>
> May you explain the discrepancy?
>
> Thank you in advance.
>
> Cheers,
>
> Dave
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to