In some ways it's not until you have a multiple alignment that you have the strongest indication that it is contamination rather than just highly conserved. This is less an issue for the non-mammals though where even "ultraconserved" is going to be no more than 90%.
On Feb 6, 2010, at 6:58 AM, Adam Siepel wrote: > Hi folks -- yes, this is an unfortunate problem, but I've always > resisted tackling it at the level of the phastCons tracks. It > really would be best to filter these elements out of the > assemblies. Barring this, I would suggest addressing it at the > level of the alignments, because it's not just phastCons that is > affected -- any method that makes use of patterns of conservation in > the multiple alignments is likely to be confused by these regions. > Adam > > On Feb 5, 2010, at 12:49 PM, Jim Kent wrote: > >> I remember facing this issue of conservation via human >> contamination when we were first >> doing comparative genomics when the mouse was sequenced. It's one >> reason we didn't call >> the ultraconserved regions at that point. It wasn't until the rat >> sequence was available and >> they were conserved there that we were convinced they weren't >> artifacts. In the process we >> did flag them in the mouse and get the assemblers to remove ones >> that where there was not >> excellent evidence joining them to non-conserved regions in the >> mouse assembly. >> >> So, I am not surprised this is a problem. The best solution is to >> get the xenopus and zebrafish >> assemblies cleaned up. I'll cc this message to [email protected] >> the help link for >> Zebrafish, and to Dan Rhoksar who I know did some work at least in >> the past on Xenopus. >> I'll also cc Adam Seipel the author of phastCons, and our own David >> Haussler to collect >> their thoughts on the best way to proceed. >> >> Take care >> Jim >> >> On Feb 5, 2010, at 1:41 AM, Philippe Gautier wrote: >> >>> Hello, >>> I'm new on this mailing list so, apologies if it's the wrong place >>> to >>> ask the following question. Feel free to redirect me if needed! >>> I'm working in a Bioinformatics service in our Unit and someone >>> asked me >>> if they could get a list of most conserved elements in >>> vertebrates. I >>> thought "easy, I just have to download the phastConsElements46way >>> table >>> and take the highest score ones. >>> I decided to check "manually" a few of them and was horrified to see >>> that all (or most) seem to be artifacts due to human genomic DNA >>> contaminant in other species. >>> One example: the longest element: >>> chr5:69686054-6970347 in GRch37, lod=14726, score=995. >>> looks like it is conserved only in Xenopus and not other vertebrates >>> (looking at the Multi Z alignment tracks). And when I realigned it >>> to >>> the corresponding Xenopus genomic sequence (scaffold_7921: >>> 87-17248) it >>> is virtually identical (>97% over 17Kb), undoubtedly a >>> contamination! >>> Moreover, I looked at several other elements down the list and >>> almost >>> all the top one (longest ones) are similar: not conserved in any >>> vertebrate, except in Xenopus or Zebrafish. These pieces of DNA do >>> contain LINE or LTR repeats so, are present in the human genome in >>> multiple copies, but that does not explain such a high >>> conservation in >>> frog or fish, and could only be explain by genome contaminations. >>> Obviously, it is a problem at the assembly level, but I was also >>> wondering if these elements should not be filtered out of the >>> phastCons >>> element list? >>> >>> Philippe >>> >>> -- >>> Philippe Gautier >>> Bioinformatics Service >>> MRC - Human Genetics Unit >>> Western General Hospital >>> Crewe Road >>> Edinburgh EH4 2XU >>> U.K. >>> tel: 0131 332 24 71 >>> >>> >>> >>> _______________________________________________ >>> Genome maillist - [email protected] >>> https://lists.soe.ucsc.edu/mailman/listinfo/genome >> > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
