Hi, I downloaded the hg19 refFlat.txt.gz file from UCSC (ftp://genome-ftp.cse.ucsc.edu/goldenPath/hg18/database/refFlat.txt.gz). It has multiple lines for a single gene. For instance, the command:
% grep -e DEFB106A -e DEFB106B refFlat.txt generates the four lines: DEFB106A NM_152251 chr8 - 7340025 7343909 7340125 7343904 2 7340025,7343855, 7340274,7343909, DEFB106A NM_152251 chr8 + 7682693 7686575 7682698 7686475 2 7682693,7686326, 7682747,7686575, DEFB106B NM_001040704 chr8 - 7340025 7343909 7340125 7343904 2 7340025,7343855, 7340274,7343909, DEFB106B NM_001040704 chr8 + 7682693 7686575 7682698 7686475 2 7682693,7686326, 7682747,7686575, In other words, DEFB106A and DEFB106B have exactly the same annotation. I realize this is because of duplicate regions, but the co-ordinates at NCBI are: * DEFB106A as Chromosome 8 (7682694..7686575) * DEFB106B as Chromosome 8 (7340026..7343909, complement) Is there any way of getting the refFlat file with only the NCBI version of the gene co-ordinates (or as close to those coordinates as possible) ? Thanks, Vamsi _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
