Re: [Jalview-discuss] Align general sequences

Geoff Barton Wed, 30 May 2012 08:08:13 -0700

Aha!  I was just about to answer this as well.  To add to Jim's 
comments, as a starting point, you could just try reading in your 
sequences to Jalview and seeing what it makes of it.  You'll need to put 
your strings in a text file that has one of the formats supported by 
Jalview - I suggest FASTA as this is pretty simple.   Although Jalview 
is meant for sequences with at most a 20 letter alphabet it will not 
complain about the non-standard letters and if you use the standard 
colouring schemes and PID tree calculation you may get something that 
helps you interpret your strings.  I just tried this with some dummy 
data and the tree made sense even though it was ignoring some of the 
characters.


This is all predicated by:

1. Your strings are all the same length
2. You do not need to align them optimally - i.e. position 1 always 
aligns with 1, 2 with 2 and so on.

If you need to align them, then see Jim's comments on tools that can 
cope with non-protein/DNA/RNA sequences.  Likewise, if you want to get 
serious about the interpretation of the trees you will need to recode to 
make sure your scoring metric for comparing strings is sensible given 
the data.

If you don't need to do alignment (i.e. put in insertions/deletions) 
then there are 101 ways of comparing the strings and scoring the 
coparison and plotting the results as trees or PCA plots or networks.  
The statistics package "R" would be a useful tool for this.

As an aside, about 15 years ago someone asked me a similar question - 
they were working on the pre-Incan Quipu knot system and wanted to 
compare Quipu to each other by multiple alignment.  In the end I think 
they used an adapted version of Clustal for this since in their case it 
was clear the Quipu would require insertions/deletions to find common 
regions.

Have fun!

Geoff.


On 30/05/2012 15:43, Jim Procter wrote:
> Hello there Karen.
>
> On 28/05/2012 20:14, Karen Keight wrote:
>> Hi! I've just discovered Jalview, I have a social sciences background,
>> I'm studying life trajectories as sequences of states.
> OK. Sounds interesting !
>> I'm wondering if I could apply Jalview for aligning sequences of  a
>> fixed number of  symbols (for example 50) each
>> one representing a life state of one person (A= Birth, B= Start
>> elementary school, C= Start Ballet, D= Get married, E= First child,
>> F= Get divorced, G= Second marriage, ...,&= Quit first job, *= Night
>> courses,.. etc).
>> So each sequence is one person's life and I have for example 3000 persons.
>>
>> Could I apply the tree-based grouping of sequences functionality (e.g.
>> Neighbour joining or Average disance)  to get similar life
>> trajectories using Jalview?
>> I hope so!!!
> You certainly could in principle.. neighbour joining and average
> distance are both generic tree algorithms that work on a matrix of
> distances, and you could even use Jalview's PCA function, which is
> another kind of cluster analysis, but you would first need to align your
> sequences (if they are not simply based on time points), and you'd also
> need to be careful about how you encode life states if you used
> Jalview's built in distance matrix calculations.
>
> Most bioinformatics analysis encode molecular sequences using a standard
> alphabet, and Jalview applies a filter before analysing the sequences to
> ensure the analysis doesn't fail because of an unknown symbol. Even
> worse, if you use the BLOSUM62 score model for tree building, then
> different symbol matches are scored in different ways because some
> mutations are more likely than others (kind of like someone starting
> modern dance rather than ballet, or some other pre-teen activity).
>
> It wouldn't be too hard to adapt Jalview's filters to support a wider
> range of states, or use a different similarity model, and in fact, its
> something I'd like it to be able to do in the future. If you, or anyone
> you know is a java programmer, then I'd happily point out the parts that
> need to be modified for your needs (the colour schemes wouldn't work for
> your symbols either, if you have more than 20).
>
> Another alternative, which is a bit more technical, but ultimately more
> rigorous, would be to use one of the multiple alignment tools that have
> no built-in rules about sequence symbols. One or two of the programs
> jalview uses can align generic sequences, and the Notredame group, which
> produced one of the most accurate alignment programs, developed a tool
> called SALTT (http://www.tcoffee.org/saltt/). This supports the analysis
> of generic symbol sequences, and might be just what you need.
>> Thanks in advance and my apologies for this naive question..
> You're welcome - and the question is not naive at all!  The underlying
> algorithms and mathematical problems used in molecular sequence analysis
> are exactly what you might use for your life sequence analysis problem..
> the only differences are in how the symbols are interpreted, and what
> assumptions about the data have been made. The techniques most relevant
> to your analysis are hidden markov models - which were originally
> developed for time series analysis, where some hidden process generates
> a sequence of observable states, and one wants to model the hidden
> process, in order - for example - to decide whether one hidden process
> is similar to the process that generated another series of states.
>
> I hope that helps - there are piles of literature on the subject, and
> I'm sure that others on the list might be able to suggest some fruitful
> directions.
> Jim.
>
> _______________________________________________
> Jalview-discuss mailing list
> [email protected]
> http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss
>

-- 
Geoff Barton, Professor of Bioinformatics,  College of Life Sciences
University of Dundee, Scotland, UK.          [email protected]
Tel:+44 1382 385860/388731 (Fax:385764)     www.compbio.dundee.ac.uk

The University of Dundee is registered Scottish charity: No.SC015096

_______________________________________________
Jalview-discuss mailing list
[email protected]
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

Re: [Jalview-discuss] Align general sequences

Reply via email to