Hi Aaron,

Thank you for the quick and spot on reply. 

>> I recently started using progressiveMauve to align large eukaryotic genomes 
>> and ran into some problems: 
>> 
>> 1) the studied genomes are repeat masked (i.e. contain long stretches of 
>> Ns). When extracting homologous segments of the input genomes from the 
>> backbone file I found that some are located in masked regions. Is there a 
>> way to prevent Mauve from using masked regions in identifying homologous 
>> segments? As far as I am aware, no such parameter exists for the 
>> incorporated muscle sequence aligner. 
> 
> there is currently no good way to prevent this behavior. It is likely
> happening because the flanking regions were identified as positionally
> homologous and so used as alignment anchors, and the masked region
> between them in the two genomes became aligned because they were between
> anchors. This happens because the N are internally converted to A by the
> aligner when it stores the sequence in a 2-bit-per-base encoding. 
> 
> It might be possible to modify the homology HMM to include N as a
> possible emission with probabilities reflecting a mixture of A,C,G,T and
> so adjust the posterior probability of homology accordingly. This would
> require some tinkering with the code.

Would you recommend to partition each input sequence into multiple records of a 
multi-fasta file so as to omit large masked regions? As I understood from the 
manual and from previous posts on this mailing list, progressiveMauve 
concatenates all records in multi-fasta files of the input. I guess my question 
is: in constructing the backbone, does it prevent homologous segments to cover 
more than one record of a multi-fasta file?

>> 2) I observe sometimes strange lines in the backbone file such as the 
>> following:
>> ___
>> 7691835 7691966 -85715547       -85715547       0       0       0       0    
>>    349474437       349474583       -700243823      -700243822      0       0
>> 8282300 8282275 0       0       0       0       0       0       0       0    
>>    0       0       0       0
>> ___
>> 
>> Note that in the first line, the segments specified by columns [3,4] and 
>> [11, 12] have lengths 0 and -1, respectively. Negative lengths mostly occur 
>> for segments that are not homologous to segments in other genomes, as shown 
>> in the second line (which makes me wonder why they are included in the 
>> backbone file in the first place).
> 
> I've not seen this before but yes it does seem like a bug! As a
> workaround, is it possible to ignore these segments in your downstream
> processing until I can get a fix?

Yes, currently I identify and discard these homologous blocks when processing 
the backbone file. I like to note that these “strange lines” occur extremely 
rarely in my dataset - only 89 out of 1693288 lines in the backbone file 
contain entries of negative segment length. 

Thank you again for your very helpful reply,
Daniel
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Mauve-users mailing list
Mauve-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mauve-users

Reply via email to