Hello Shuying, The reason this occurs is because we generate our CpG Island data using a repeat-masked genome. Any CG sites within repeat regions are blocked.
It may be helpful to turn on the RepeatMasker track in the 'Variation and Repeats' section to see which parts of the genome have been masked. I hope this clears things up for you. Please contact us again if you have further inquiries. Best Antonio Coelho UCSC Genome Bioinformatics Group ----- Original Message ----- From: "S. Sun" <[email protected]> To: [email protected] Cc: [email protected] Sent: Monday, June 14, 2010 8:34:30 PM GMT -08:00 US/Canada Pacific Subject: [Genome] Questions about CG count in one CpG island: chrX:38071367-38071954. Hello, I have a simple question about the number of CG site in one CpG island: chrX:38071367-38071954. The UCSC genome browser shows that it has 58 CG sites, but when I count the number of CG site by myself, it is 63 CG sites. I got 63 in both my R code results and manual counting (i.e., in linux text file, search "CG" and highlight" them). In fact, I got 63 in both the DNA sequences I downloaded from UCSC genome browser and the hg18 version sequence I got from R Bioconductor package. See the following for more details. Do you have any idea why we have this type of inconsistent result? Is it because those 5 CG sites located in the repeat region, so they are not included? If yes, why these 5 CG sites are dealt in this way? ############ UCSC hg18 version DNA sequences ############################ >hg18_cpgIslandExt_CpG: 58 range=chrX:38071367-38071954 5'pad=0 3'pad=0 >strand=+ repeatMasking=lower CGTCCGGTCCTCTGCCCTCAGTCATTCGCGGGAGCGCAACCAGCGATCCC # 7 (including the one with C at the end and G at the beginning) GCCCCAGTCCGGCTGCCAAGCCTGGGGCCTGTCCCCCTACAGGGCCGATC # 2 CGGAggcggggcccggccgcccgcggACCCTCCCTCCCGGCCTTCCGCCA # 8 CCGGCGCGGGCGCAACTCACCGGGCATCAGCTCTTCCGGCTCCCTCATGC # 6 CACGGGCAGTACGGGCAGCCTGCGCCGGGGCCAGGAGGCTGTAGAGGACG # 5 GTTTGGTCGGGGCTAAAGCAGCTACTCCGCACCGACGCGGGCCGCGAAAG # 7 CCCCCAAGTTCCGCATGGCGAAACTCCGGAGATCAACTACAACCGCGCTC # 5 CCGGAAGTCAACAAACAGCCGCTACGGGCAACGGGGGCGGAGCTTGGGAA # 5 TGCAAGGCGGGACAGGCGCCGTTGGGGAGGGGAACGGAGGCCGGGTGGCT # 5 GGTAAGGGGCAGGCTCAGGCACAGCGGAGGGGCAGTAGAGACCACGCGCC # 3 CTCTGGCGGCCTGGAGCAGAGAGGCGGCCACGCCGCGCAGTGATGCTGTG # 5 GAGTCCGCGCCCTTGTGCCGTTGGAGGTCCAGGCGCCG # 5 ###################### From R Bioconductor genome sequence CGTCCGGTCCTCTGCCCTCAGTCATTCGCGGGAGCGCAACCAGCGATCCC # 7 GCCCCAGTCCGGCTGCCAAGCCTGGGGCCTGTCCCCCTACAGGGCCGATC # 2 CGGAGGCGGGGCCCGGCCGCCCGCGGACCCTCCCTCCCGGCCTTCCGCCA # 8 CCGGCGCGGGCGCAACTCACCGGGCATCAGCTCTTCCGGCTCCCTCATGC # 6 CACGGGCAGTACGGGCAGCCTGCGCCGGGGCCAGGAGGCTGTAGAGGACG # 5 GTTTGGTCGGGGCTAAAGCAGCTACTCCGCACCGACGCGGGCCGCGAAAG # 7 CCCCCAAGTTCCGCATGGCGAAACTCCGGAGATCAACTACAACCGCGCTC # 5 CCGGAAGTCAACAAACAGCCGCTACGGGCAACGGGGGCGGAGCTTGGGAA # 5 TGCAAGGCGGGACAGGCGCCGTTGGGGAGGGGAACGGAGGCCGGGTGGCT # 5 GGTAAGGGGCAGGCTCAGGCACAGCGGAGGGGCAGTAGAGACCACGCGCC # 3 CTCTGGCGGCCTGGAGCAGAGAGGCGGCCACGCCGCGCAGTGATGCTGTG # 5 GAGTCCGCGCCCTTGTGCCGTTGGAGGTCCAGGCGCCG # 5 Shuying _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
