Re: [Jalview-discuss] Rv: Align general sequences

Geoff Barton Mon, 04 Jun 2012 07:10:59 -0700

HI Karen,

I know Jim has replied in some detail about how you might use Jalview tohelp in your analysis, so I won't add very much here. While Jalviewcould perhaps be adapted to help with your research I think you reallyshould talk to someone/collaborate with someone who has experience ofmultivariate data analysis in order to explain the question you aretrying to address and thus devise the most appropriate analysissolution. Since your strings are all the same length and do not needalignment, I suggested in my last email that you look at R and thevarious mutivariate techniques it implements. You also need to decidehow best to represent the transitions between the states in yoursystem. There are many possible ways to do this, but while my group(including Jim) has expertise in this sort of thing, your problem areais a long way from what we do and it would be hard for us to justify acollaboration.

So, please use Jalview if you like to visualise your data, but do findsomeone in your institution who can sit down with you and discuss thebest way to go about answering the questions you are interested inanswering, particularly with respect to clustering.

I wish you all the best in your research - please let us know when youfind a good solution and particularly where we should look to see thefinal publication of your work!


With every good wish,

Geoff.

On 03/06/2012 19:14, Jim Procter wrote:

Hi Karen.
We don't allow attachments on the list, I'm afraid - but I did take alook at what you sent.
On 03/06/2012 10:24, Karen Keight wrote:
Hi Jim and Geoff.. I've sent a message to the discussion list and itis waiting for approval because of the attachments.Below is the message I've sent .. just in case it does not arrive tothe discussion list..I think that there's quite a lot of functionality in Jalview that Icould apply to my problem and I've started but I'm not sure if it isok so far and how to proceed from now on.
You seem to have done a fair amount of experimentation! I do have acouple of comments:
I've been arranging my dataset for processing it in Jalview, I attachit here (testdata.txt) in FASTA format (hopefully)- Now I have 123 participants, each one is a sequence of 600 timepoints.- At each time point a participant can be in one of 14 states, Iused 14 Protein letters (A,R,N,D,C,E,Q,G,H,L,K,M,F,P) to represent mystates for Jalview to read it. Of course I selected that 14 letterswithout following any biological criteria, just picked up a letterfor each state.-For my dataset the P letter represents a NULL state meaning thatthere is no time point evaluation for the subject , so it is usedwhen needed for every subject to complete the 600 time points.
P is, unfortunately, a fairly bad choice. It doesn't matter forpercent identity, but generally, P is a 'special' amino acid, andrarely mutates. See below..
Is it in correct FASTA format?
If jalview read it, and the symbols appear in the alignment window inthe way you would expect, then it is correct!
I've tried some trees and looks fine for average distance, neighbourdistance looks strange..
NJ can often look strange if the sequences are not well-related. Inthis case, P112 has been chosen as the 'outgroup' - the one thatappears furthest away from all other sequences, the rest of theindividuals seem to fall into two more closely related groups.
What option did you use to calculate each of these ? Percentageidentity and BLOSUM62 will give different trees - you'll need to usethe latter if you want distances to account for similar states.
I don't know if there's some way to take advantage of the followinginformation, to give some semantics to the clusters:- A and N states represents situations in life that are related , forexample both of them refer to EDUCATION- R, C, E, Q, L states represents also similar life situations, forexample all of them deal with FAMILY
- D, G, H, K, M represent EMPLOYMENT/JOBS
- P state represents NULL
-F represents a mixture of states that have no interpretation for themoment.
You will need to recode your states to map similar states to similaramino acids - then you'll be able to take advantage of the intrinsicamino acid similarities that the conservation and blosum62 measuresemploy.
The a Venn diagram here:http://www.jalview.org/help/html/misc/aaproperties.html whichindicates the various properties shared by different amino acids. Ifyou want to have a 'NULL' state, then G is probably the best one tochoose - but you can also use '-' - the gap character. Gaps aretreated specially, and might actually be closest to what you wouldconsider the 'null's to indicate.
I think that useful information that I should use comes from theConservation, Quality and Consensus graphs shown below the sequencesIf I understand it well, the Consensus Graph shows for each of my 600time points the most frequent states and their %
I'm having most Fs at the begining and Ps thereafter.
Could the Quality measure be useful for my sequences?
if you re-encode your states according to the amino acid groupings,then you'll certainly get some informatino from the quality andconservation measures. Conservation measures the number of commonproperties for the amino acids in a column, and quality measures theaverage score for the mutations observed in a column - so unlikelytransitions will result in a lower quality score.
The Jmol visualization looks really nice!!!!!!!!!!!!!!!!!! Could itmake sense to assign some structure to my sequences and try tovisualize them like that?
almost certainly not, I'm afraid :)
Would it be useful for finding some kind of regularity or relation?
Excuse for the naive question but what do you use Jmol visualizationfor?
Jmol is a molecular structure viewer. The sequences Jalview normallyhandles are 'shorthand' for biological molecules. Similar sequences -particularly evolutionarily related ones - have a similar 3D molecularstructure, and can often perform the same kinds of chemicalinteractions (because they have similar shapes).
I've been programming in Java and I think that it would be reallynice if I could apply/adapt some of the functionalities to my problem.
OK. You would certainly be able to create a special set of parametersfor your symbols. The matrix encoding the amino acid groupings ishard-coded into Jalview's source - so it would be straightforward foryou to change the groupings to better reflect the way you are usingthe symbols. Take a look at the various matrices in this file:
http://source.jalview.org/gitweb/?p=jalview.git;a=blob_plain;f=src/jalview/schemes/ResidueProperties.java;hb=refs/heads/master
The basic philosophy here is that jalview maps the letter in eachsequence to an amino acid or nucleotide index which is used to map into the various score matrices and property vectors (the indexes aregiven by aaIndex and nucleotideIndex). There are two types of scoremodels used here, whilst the rest of the file contains hard-codedcolour maps for the built in colour schemes, and look up tables forconverting indexes into text for displaying the name of the amino acidor nucleotide to the user.The two types of models are substitution matrices - which reflect thesimilarity between two symbols, and property matrices which allows'conservation' to be calculated, in order to reflect the number ofproperties that are different for the symbols in a particular column.The conservation may be very useful to you - see the paper linked toin this post for more explanation:http://www.compbio.dundee.ac.uk/pipermail/jalview-discuss/2012-May/000811.html
Jim.



_______________________________________________
Jalview-discuss mailing list
[email protected]
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss


--
Geoff Barton, Professor of Bioinformatics,  College of Life Sciences
University of Dundee, Scotland, UK.          [email protected]
Tel:+44 1382 385860/388731 (Fax:385764)     www.compbio.dundee.ac.uk

The University of Dundee is registered Scottish charity: No.SC015096

_______________________________________________
Jalview-discuss mailing list
[email protected]
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

Re: [Jalview-discuss] Rv: Align general sequences

Reply via email to