Hello Jun,

The duplicate entries you are seeing are from refSeq genes that aligned 
to more than one location. There are many such entries.

Note this part of the RefSeq track description:

RefSeq RNAs were aligned against the human genome using blat; those with 
an alignment of less than 15% were discarded. When a single RNA aligned 
in multiple places, the alignment having the highest base identity was 
identified. Only alignments having a base identity level within 0.1% of 
the best and at least 96% base identity with the genomic sequence were kept.


Best regards,

Pauline Fujita
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu

On 10/27/10 06:36, Lu, Jun (NIH/NIEHS) [C] wrote:
> Hi,
> 
> I couldn't get an answer so would like to see whether someone here knows the 
> reason.
> 
> I downloaded a refseq hg19 flat file "refFlat.txt.gz" from here: 
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/
> 
> What I have noticed is that I saw significant number of duplicates in this 
> file, i.e. the same NM#, but with different coordinate annotation. For 
> example:
> TSPY3   NM_001077697    chrY    +       9236029 9238826 9236075 9238638 6     
>   9236029,9237168,9237374,9237587,9237839,9238615,        
> 9236561,9237246,9237486,9237733,9237921,9238826,
> TSPY3   NM_001077697    chrY    +       9365488 9368285 9365534 9368097 6     
>   9365488,9366627,9366833,9367046,9367298,9368074,        
> 9366020,9366705,9366945,9367192,9367380,9368285,
> 
> It seems that all positions are shifted.
> 
> Thanks.
> Jun
> 
> 
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to