Hello, I am trying to do a multiple alignment of the genomes of several organisms. To make sure I am doing this correctly, I tried to recreate the rheMac2 to rn4 pairwise alignment that is available from the UCSC FTP server. I found some discrepancies between my alignment and the UCSC alignment in the repeat regions.
>From looking at src/hg/utils/automation/blastz-run-ucsc, I understand that >repeats are removed from the Fasta genome files by strip_rpts before running >lastz. In src/hg/makeDb/doc/rheMac2.txt, the repeats to be removed are determined by running DateRepeats chr*.fa.out -query human -comp mouse -comp dog and then running extractRepeats 1 on the output. I couldn't find the extractRepeats program, but I am guessing that I can get the appropriate result by DateRepeats chr*.fa.out -query human -comp mouse Then I run selectRpts in blastz-run-ucsc on chr7.fa.out_mus-musculus to generate the chr*.rpts file, which I then use with strip_rpts to generate the stripped chromosome. I then run lastz on the stripped chromosomes. However, when I look at the UCSC genome-wide alignment between rheMac2 and rn4, it seems that the repeats that should have been removed are included in the alignment. As an example, one of the repeats removed by strip_rpts is the LINE/L2 repeat at chr7:87564770..87565324 in rheMac2. But the first aligned chain in rheMac2.rn4.all.chain.gz from UCSC starts with chain 547645084 chr7 169801366 + 87564296 169142947 chr6 147636619 + 64051113 13 8454126 1 17 2 0 57 0 1 19 0 1 7 7 0 139 6 0 15 1 0 12 7 0 4 3 0 11 0 1 57 1 0 52 0 1 23 6 0 61 0 1 <=== this block overlaps the LINE/L2 repeat 51 2 0 71 1 0 ... Now I understand that the lastz alignments shouldn't seed in a repeat, but are allowed to extend into a repeat. But since we removed the repeat sequences from the Fasta file altogether, how can this alignment extend into a repeat? Best wishes, and many thanks in advance, Michiel de Hoon RIKEN Omics Science Center _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
