Dear Pauline,

Many thanks for your reply. I think you are right and that I am inadvertently 
removing repeats that should not be removed.

To select the appropriate repeats from the DateRepeats output, UCSC uses the 
extractRepeats and extractLinSpecReps scripts. Can these scripts be made 
available somewhere? I couldn't find them in Jim Kent's software collection or 
elsewhere.

Thank you again again,
--Michiel

> Hello Michiel,
>
> We do not remove all repeats, we only remove the lineage specific
> repeats. It is possible that if your RepeatMasker scripts are
> failing, that you have not been able to produce the actual lineage
> specific repeats.
>
> For more info on the RepeatMasker scripts used to construct these
> files please see the associated makedoc.
>
>
> Hopefully this information was helpful and answers your question. If
> you have further questions or require clarification feel free to
> contact the mailing list at genome at soe.ucsc.edu.
> <https://lists.soe.ucsc.edu/mailman/listinfo/genome>
>
> Regards,
>
> Pauline Fujita UCSC Genome Bioinformatics Group
> http://genome.ucsc.edu
>
>
> On 06/21/11 21:00, Michiel de Hoon wrote:
>> Hello,
>> I am trying to do a multiple alignment of the genomes of
>> several organisms. To make sure I am doing this correctly, I tried to
>> recreate the rheMac2 to rn4 pairwise alignment that is available from
>> the UCSC FTP server. I found some discrepancies between my alignment
>> and the UCSC alignment in the repeat regions.
>> From looking at src/hg/utils/automation/blastz-run-ucsc, I understand
>> that repeats are removed from the Fasta genome files by strip_rpts
>> before running lastz. In src/hg/makeDb/doc/rheMac2.txt, the repeats
>> to be removed are determined by running 
>>         DateRepeats chr*.fa.out -query human -comp mouse -comp dog
>> and then running extractRepeats 1 on the output. I couldn't find the
>> extractRepeats program, but I am guessing that I can get the
>> appropriate result by
>>         DateRepeats chr*.fa.out -query human -comp mouse
>> Then I run selectRpts in blastz-run-ucsc on chr7.fa.out_mus-musculus
>> to generate the chr*.rpts file, which I then use with strip_rpts to
>> generate the stripped chromosome. I then run lastz on the stripped 
>> chromosomes.
>> However, when I look at the UCSC genome-wide alignment between
>> rheMac2 and rn4, it seems that the repeats that should have been
>> removed are included in the alignment.
>> As an example, one of the repeats removed by strip_rpts is the
>> LINE/L2 repeat at chr7:87564770..87565324 in rheMac2. But the first 
>> aligned chain in rheMac2.rn4.all.chain.gz from UCSC starts with
>>
>>> chain 547645084 chr7 169801366 + 87564296 169142947 chr6 147636619 + 
>>> 64051113 13 8454126 1
>> 17      2       0
>> 57      0       1
>> 19      0       1
>> 7       7       0
>> 139     6       0
>> 15      1       0
>> 12      7       0
>> 4       3       0
>> 11      0       1
>> 57      1       0
>> 52      0       1
>> 23      6       0
>> 61      0       1    <=== this block overlaps the LINE/L2 repeat
>> 51      2       0
>> 71      1       0
>>
>>
>> Now I understand that the lastz alignments shouldn't seed in
>> a repeat, but are allowed to extend into a repeat. But since we
>> removed the repeat sequences from the Fasta file altogether,
>> how can this alignment extend into a repeat?
>>
>> Best wishes, and many thanks in advance,
>> Michiel de Hoon
>> RIKEN Omics Science Center

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to