Aha! I was just about to answer this as well. To add to Jim's comments, as a starting point, you could just try reading in your sequences to Jalview and seeing what it makes of it. You'll need to put your strings in a text file that has one of the formats supported by Jalview - I suggest FASTA as this is pretty simple. Although Jalview is meant for sequences with at most a 20 letter alphabet it will not complain about the non-standard letters and if you use the standard colouring schemes and PID tree calculation you may get something that helps you interpret your strings. I just tried this with some dummy data and the tree made sense even though it was ignoring some of the characters.
This is all predicated by: 1. Your strings are all the same length 2. You do not need to align them optimally - i.e. position 1 always aligns with 1, 2 with 2 and so on. If you need to align them, then see Jim's comments on tools that can cope with non-protein/DNA/RNA sequences. Likewise, if you want to get serious about the interpretation of the trees you will need to recode to make sure your scoring metric for comparing strings is sensible given the data. If you don't need to do alignment (i.e. put in insertions/deletions) then there are 101 ways of comparing the strings and scoring the coparison and plotting the results as trees or PCA plots or networks. The statistics package "R" would be a useful tool for this. As an aside, about 15 years ago someone asked me a similar question - they were working on the pre-Incan Quipu knot system and wanted to compare Quipu to each other by multiple alignment. In the end I think they used an adapted version of Clustal for this since in their case it was clear the Quipu would require insertions/deletions to find common regions. Have fun! Geoff. On 30/05/2012 15:43, Jim Procter wrote: > Hello there Karen. > > On 28/05/2012 20:14, Karen Keight wrote: >> Hi! I've just discovered Jalview, I have a social sciences background, >> I'm studying life trajectories as sequences of states. > OK. Sounds interesting ! >> I'm wondering if I could apply Jalview for aligning sequences of a >> fixed number of symbols (for example 50) each >> one representing a life state of one person (A= Birth, B= Start >> elementary school, C= Start Ballet, D= Get married, E= First child, >> F= Get divorced, G= Second marriage, ...,&= Quit first job, *= Night >> courses,.. etc). >> So each sequence is one person's life and I have for example 3000 persons. >> >> Could I apply the tree-based grouping of sequences functionality (e.g. >> Neighbour joining or Average disance) to get similar life >> trajectories using Jalview? >> I hope so!!! > You certainly could in principle.. neighbour joining and average > distance are both generic tree algorithms that work on a matrix of > distances, and you could even use Jalview's PCA function, which is > another kind of cluster analysis, but you would first need to align your > sequences (if they are not simply based on time points), and you'd also > need to be careful about how you encode life states if you used > Jalview's built in distance matrix calculations. > > Most bioinformatics analysis encode molecular sequences using a standard > alphabet, and Jalview applies a filter before analysing the sequences to > ensure the analysis doesn't fail because of an unknown symbol. Even > worse, if you use the BLOSUM62 score model for tree building, then > different symbol matches are scored in different ways because some > mutations are more likely than others (kind of like someone starting > modern dance rather than ballet, or some other pre-teen activity). > > It wouldn't be too hard to adapt Jalview's filters to support a wider > range of states, or use a different similarity model, and in fact, its > something I'd like it to be able to do in the future. If you, or anyone > you know is a java programmer, then I'd happily point out the parts that > need to be modified for your needs (the colour schemes wouldn't work for > your symbols either, if you have more than 20). > > Another alternative, which is a bit more technical, but ultimately more > rigorous, would be to use one of the multiple alignment tools that have > no built-in rules about sequence symbols. One or two of the programs > jalview uses can align generic sequences, and the Notredame group, which > produced one of the most accurate alignment programs, developed a tool > called SALTT (http://www.tcoffee.org/saltt/). This supports the analysis > of generic symbol sequences, and might be just what you need. >> Thanks in advance and my apologies for this naive question.. > You're welcome - and the question is not naive at all! The underlying > algorithms and mathematical problems used in molecular sequence analysis > are exactly what you might use for your life sequence analysis problem.. > the only differences are in how the symbols are interpreted, and what > assumptions about the data have been made. The techniques most relevant > to your analysis are hidden markov models - which were originally > developed for time series analysis, where some hidden process generates > a sequence of observable states, and one wants to model the hidden > process, in order - for example - to decide whether one hidden process > is similar to the process that generated another series of states. > > I hope that helps - there are piles of literature on the subject, and > I'm sure that others on the list might be able to suggest some fruitful > directions. > Jim. > > _______________________________________________ > Jalview-discuss mailing list > [email protected] > http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss > -- Geoff Barton, Professor of Bioinformatics, College of Life Sciences University of Dundee, Scotland, UK. [email protected] Tel:+44 1382 385860/388731 (Fax:385764) www.compbio.dundee.ac.uk The University of Dundee is registered Scottish charity: No.SC015096 _______________________________________________ Jalview-discuss mailing list [email protected] http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss
