Re: [Jalview-discuss] Align general sequences

Jim Procter Wed, 30 May 2012 07:43:53 -0700

Hello there Karen.

On 28/05/2012 20:14, Karen Keight wrote:
> Hi! I've just discovered Jalview, I have a social sciences background, 
> I'm studying life trajectories as sequences of states.
OK. Sounds interesting !
> I'm wondering if I could apply Jalview for aligning sequences of  a 
> fixed number of  symbols (for example 50) each
> one representing a life state of one person (A= Birth, B= Start 
> elementary school, C= Start Ballet, D= Get married, E= First child,
> F= Get divorced, G= Second marriage, ..., &= Quit first job, *= Night 
> courses,.. etc).
> So each sequence is one person's life and I have for example 3000 persons.
>
> Could I apply the tree-based grouping of sequences functionality (e.g. 
> Neighbour joining or Average disance)  to get similar life 
> trajectories using Jalview?
> I hope so!!!
You certainly could in principle.. neighbour joining and average 
distance are both generic tree algorithms that work on a matrix of 
distances, and you could even use Jalview's PCA function, which is 
another kind of cluster analysis, but you would first need to align your 
sequences (if they are not simply based on time points), and you'd also 
need to be careful about how you encode life states if you used 
Jalview's built in distance matrix calculations.


Most bioinformatics analysis encode molecular sequences using a standard 
alphabet, and Jalview applies a filter before analysing the sequences to 
ensure the analysis doesn't fail because of an unknown symbol. Even 
worse, if you use the BLOSUM62 score model for tree building, then 
different symbol matches are scored in different ways because some 
mutations are more likely than others (kind of like someone starting 
modern dance rather than ballet, or some other pre-teen activity).

It wouldn't be too hard to adapt Jalview's filters to support a wider 
range of states, or use a different similarity model, and in fact, its 
something I'd like it to be able to do in the future. If you, or anyone 
you know is a java programmer, then I'd happily point out the parts that 
need to be modified for your needs (the colour schemes wouldn't work for 
your symbols either, if you have more than 20).

Another alternative, which is a bit more technical, but ultimately more 
rigorous, would be to use one of the multiple alignment tools that have 
no built-in rules about sequence symbols. One or two of the programs 
jalview uses can align generic sequences, and the Notredame group, which 
produced one of the most accurate alignment programs, developed a tool 
called SALTT (http://www.tcoffee.org/saltt/). This supports the analysis 
of generic symbol sequences, and might be just what you need.
> Thanks in advance and my apologies for this naive question..
You're welcome - and the question is not naive at all!  The underlying 
algorithms and mathematical problems used in molecular sequence analysis 
are exactly what you might use for your life sequence analysis problem.. 
the only differences are in how the symbols are interpreted, and what 
assumptions about the data have been made. The techniques most relevant 
to your analysis are hidden markov models - which were originally 
developed for time series analysis, where some hidden process generates 
a sequence of observable states, and one wants to model the hidden 
process, in order - for example - to decide whether one hidden process 
is similar to the process that generated another series of states.

I hope that helps - there are piles of literature on the subject, and 
I'm sure that others on the list might be able to suggest some fruitful 
directions.
Jim.

_______________________________________________
Jalview-discuss mailing list
[email protected]
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

Re: [Jalview-discuss] Align general sequences

Reply via email to