Hello Dave, It looks as if you are finding RefSeq sequences with incomplete transcripts (CDS intact, but one of the UTRs missing).
For the example NM_000462, the 5' UTR is not included in the GenBank record, which we use as a data source to annotated the UTRs and CDS regions: http://www.ncbi.nlm.nih.gov/nuccore/19718765?report=GenBank The RefSeq data can include non-coding and partial transcripts. By examining the data in the Browser, I was able to identify another variant that does have an intact 5' UTR region: NM_130839.1. Manually examining each transcript variant set associated with a gene is not practical for large datasets, but it was a good sanity check to find out what was going on in that region of the genome. I also opened up the UCSC Genes track as a reference, since RefSeq transcripts are incorporated into that track as an input. The analysis ideas I wanted to share to help you, in case another question like this comes up about content. It sometimes helps to just take a look at a sample of unexpected data in context, against the reference genome with other relevant tracks open. But, please feel free to ask the mailing list if you have questions again. Best regards, Jen UCSC Genome Browser Support http://genome.ucsc.edu/contacts.html [email protected] [email protected] On 6/25/10 9:47 AM, Dave Tang wrote: > Dear Genome List, > > While using the table browser to fetch different regions of refseqs gene > models (genome: hg19, group: mRNA and EST tracks, table: refseq genes and > output format: BED), I find different numbers of refGene ids. I would > expect to find the exact number of refGene ids in each region bed file. > > For example, using the table browser I got all the 5' UTR and 3' UTR > refGene regions. Here are the first lines of each respective bed file: > > #5' UTR bed file > chr1 66999824 67000041 NM_032291_utr5_0_0_chr1_66999825_f > 0 + > > #3' UTR bed file > chr1 67208778 67210767 NM_032291_utr3_24_0_chr1_67208779_f > 0 + > > This is good because the refseq id (NM_032291) is in both. I parsed each > file to get a non-redundant list of all the refGene ids in each bed file > and found different numbers. For example, the refseq id NM_000462 is only > in the 3' UTR bed file and not the 5' UTR. > > In total there are 30408 and 30627 refGene ids in the 5' and 3' UTR bed > files, respectively. > > May you explain the discrepancy? > > Thank you in advance. > > Cheers, > > Dave > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
