Hi mauve users,
I routinely use progressiveMauve via the command-line Linux x64 binary
(snapshot 2015-02-13) to align closed bacterial genomes from many strains of
the same species. These genomes differ very little in sequence (SNPS) and
content (genes), but often have very heterogeneous structures due to
rearrangements. I am attempting to characterize structural/rearrangement
differences conserved among sub-sets of genomes, which are otherwise not
collinear, by parsing the .backbone file. However, I frequently observe that
backbone alignments of 'many' genomes (n>=50) are not consistent with
alignments of 'few' genomes (n=5). That is to say, when inspecting LCBs or
aligned sequence backbone blocks, the larger alignments often fail to match
homologous stretches shared by a few genomes, while these same loci are matched
in a smaller alignment. Further, my input is annotated genbank files and the
genes in these un-matched blocks appear to be orthologs by manual inspection.
In essence, the larger alignment 'over-calls' the number of LCBs shared by the
set of genomes.
These unmatched sequences are usually observable in the backbone file (and the
xmfa) as additional strain-specific blocks (or LCBs), often of the same
approximate length, like in this simplified example:
seq0_leftend seq0_rightend seq1_leftend seq1_rightend
2123820 2129873 0 0
0 0 2123981 2129885
My approach to combat this has been to test many values for the parameters
'-seed-weight' and '-hmm-identity', and select as the optimal combination that
which produces the .backbone file with the fewest number of lines, ie. sequence
blocks. The logic being that fewer unique blocks should mean fewer un-matched
alignments between loci of 'true' homology shared between the genomes. This
parameter optimization strategy has been successful, but only to a point, and
analysis of the output still trips over apparent artifacts when I attempt to
manually validate reported break-points with an independent alignment of 4-5
representative genomes.
So I'm curious if other users have experienced similar issues of alignment
reproducibility when scaling? If so, what approaches or parameters have been
successful in making alignments of 'many' consistent with alignments of 'few'?
Given that multiple alignments are 'built' from many pairwise alignments, I
expect differences between n=2 and n=many alignments. Perhaps inputting a guide
tree to control the order in which genomes should be added to the multiple
alignment would help, particularly by first aligning genomes with the fewest
rearrangement differences?
I appreciate your suggestions and advice. Thanks.
Mike
====================
Michael R. Weigand, PhD
Bioinformatician | IHRC
NCIRD/DBD/MVPDB
Centers for Disease Control and Prevention
mweig...@cdc.gov<mailto:mweig...@cdc.gov>
404.639.2473
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Mauve-users mailing list
Mauve-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mauve-users