Hello there Karen. On 28/05/2012 20:14, Karen Keight wrote: > Hi! I've just discovered Jalview, I have a social sciences background, > I'm studying life trajectories as sequences of states. OK. Sounds interesting ! > I'm wondering if I could apply Jalview for aligning sequences of a > fixed number of symbols (for example 50) each > one representing a life state of one person (A= Birth, B= Start > elementary school, C= Start Ballet, D= Get married, E= First child, > F= Get divorced, G= Second marriage, ..., &= Quit first job, *= Night > courses,.. etc). > So each sequence is one person's life and I have for example 3000 persons. > > Could I apply the tree-based grouping of sequences functionality (e.g. > Neighbour joining or Average disance) to get similar life > trajectories using Jalview? > I hope so!!! You certainly could in principle.. neighbour joining and average distance are both generic tree algorithms that work on a matrix of distances, and you could even use Jalview's PCA function, which is another kind of cluster analysis, but you would first need to align your sequences (if they are not simply based on time points), and you'd also need to be careful about how you encode life states if you used Jalview's built in distance matrix calculations.
Most bioinformatics analysis encode molecular sequences using a standard alphabet, and Jalview applies a filter before analysing the sequences to ensure the analysis doesn't fail because of an unknown symbol. Even worse, if you use the BLOSUM62 score model for tree building, then different symbol matches are scored in different ways because some mutations are more likely than others (kind of like someone starting modern dance rather than ballet, or some other pre-teen activity). It wouldn't be too hard to adapt Jalview's filters to support a wider range of states, or use a different similarity model, and in fact, its something I'd like it to be able to do in the future. If you, or anyone you know is a java programmer, then I'd happily point out the parts that need to be modified for your needs (the colour schemes wouldn't work for your symbols either, if you have more than 20). Another alternative, which is a bit more technical, but ultimately more rigorous, would be to use one of the multiple alignment tools that have no built-in rules about sequence symbols. One or two of the programs jalview uses can align generic sequences, and the Notredame group, which produced one of the most accurate alignment programs, developed a tool called SALTT (http://www.tcoffee.org/saltt/). This supports the analysis of generic symbol sequences, and might be just what you need. > Thanks in advance and my apologies for this naive question.. You're welcome - and the question is not naive at all! The underlying algorithms and mathematical problems used in molecular sequence analysis are exactly what you might use for your life sequence analysis problem.. the only differences are in how the symbols are interpreted, and what assumptions about the data have been made. The techniques most relevant to your analysis are hidden markov models - which were originally developed for time series analysis, where some hidden process generates a sequence of observable states, and one wants to model the hidden process, in order - for example - to decide whether one hidden process is similar to the process that generated another series of states. I hope that helps - there are piles of literature on the subject, and I'm sure that others on the list might be able to suggest some fruitful directions. Jim. _______________________________________________ Jalview-discuss mailing list [email protected] http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss
