Hello Jun, The duplicate entries you are seeing are from refSeq genes that aligned to more than one location. There are many such entries.
Note this part of the RefSeq track description: RefSeq RNAs were aligned against the human genome using blat; those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. Best regards, Pauline Fujita UCSC Genome Bioinformatics Group http://genome.ucsc.edu On 10/27/10 06:36, Lu, Jun (NIH/NIEHS) [C] wrote: > Hi, > > I couldn't get an answer so would like to see whether someone here knows the > reason. > > I downloaded a refseq hg19 flat file "refFlat.txt.gz" from here: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ > > What I have noticed is that I saw significant number of duplicates in this > file, i.e. the same NM#, but with different coordinate annotation. For > example: > TSPY3 NM_001077697 chrY + 9236029 9238826 9236075 9238638 6 > 9236029,9237168,9237374,9237587,9237839,9238615, > 9236561,9237246,9237486,9237733,9237921,9238826, > TSPY3 NM_001077697 chrY + 9365488 9368285 9365534 9368097 6 > 9365488,9366627,9366833,9367046,9367298,9368074, > 9366020,9366705,9366945,9367192,9367380,9368285, > > It seems that all positions are shifted. > > Thanks. > Jun > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
