Hi mauve users,
I routinely use progressiveMauve via the command-line Linux x64 binary 
(snapshot 2015-02-13) to align closed bacterial genomes from many strains of 
the same species. These genomes differ very little in sequence (SNPS) and 
content (genes), but often have very heterogeneous structures due to 
rearrangements. I am attempting to characterize structural/rearrangement 
differences conserved among sub-sets of genomes, which are otherwise not 
collinear, by parsing the .backbone file. However, I frequently observe that 
backbone alignments of 'many' genomes (n>=50) are not consistent with 
alignments of 'few' genomes (n=5). That is to say, when inspecting LCBs or 
aligned sequence backbone blocks, the larger alignments often fail to match 
homologous stretches shared by a few genomes, while these same loci are matched 
in a smaller alignment. Further, my input is annotated genbank files and the 
genes in these un-matched blocks appear to be orthologs by manual inspection. 
In essence, the larger alignment 'over-calls' the number of LCBs shared by the 
set of genomes.

These unmatched sequences are usually observable in the backbone file (and the 
xmfa) as additional strain-specific blocks (or LCBs), often of the same 
approximate length, like in this simplified example:

seq0_leftend    seq0_rightend   seq1_leftend    seq1_rightend
2123820                2129873                0              0
0              0              2123981                2129885


My approach to combat this has been to test many values for the parameters 
'-seed-weight' and '-hmm-identity', and select as the optimal combination that 
which produces the .backbone file with the fewest number of lines, ie. sequence 
blocks. The logic being that fewer unique blocks should mean fewer un-matched 
alignments between loci of 'true' homology shared between the genomes. This 
parameter optimization strategy has been successful, but only to a point, and 
analysis of the output still trips over apparent artifacts when I attempt to 
manually validate reported break-points with an independent alignment of 4-5 
representative genomes.

So I'm curious if other users have experienced similar issues of alignment 
reproducibility when scaling? If so, what approaches or parameters have been 
successful in making alignments of 'many' consistent with alignments of 'few'? 
Given that multiple alignments are 'built' from many pairwise alignments, I 
expect differences between n=2 and n=many alignments. Perhaps inputting a guide 
tree to control the order in which genomes should be added to the multiple 
alignment would help, particularly by first aligning genomes with the fewest 
rearrangement differences?

I appreciate your suggestions and advice. Thanks.

Mike


====================
Michael R. Weigand, PhD
Bioinformatician | IHRC
NCIRD/DBD/MVPDB
Centers for Disease Control and Prevention
mweig...@cdc.gov<mailto:mweig...@cdc.gov>
404.639.2473


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Mauve-users mailing list
Mauve-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mauve-users

Reply via email to