Re: [Jalview-discuss] Rv: Align general sequences

Jim Procter Sun, 03 Jun 2012 11:14:49 -0700

Hi Karen.

We don't allow attachments on the list, I'm afraid - but I did take alook at what you sent.


On 03/06/2012 10:24, Karen Keight wrote:

Hi Jim and Geoff.. I've sent a message to the discussion list and itis waiting for approval because of the attachments.Below is the message I've sent .. just in case it does not arrive tothe discussion list..I think that there's quite a lot of functionality in Jalview that Icould apply to my problem and I've started but I'm not sure if it isok so far and how to proceed from now on.

You seem to have done a fair amount of experimentation! I do have acouple of comments:

I've been arranging my dataset for processing it in Jalview, I attachit here (testdata.txt) in FASTA format (hopefully)
- Now  I have 123 participants, each one is a sequence of 600 time points.
- At each time point a participant can be in one of 14 states, I used14 Protein letters (A,R,N,D,C,E,Q,G,H,L,K,M,F,P) to represent mystates for Jalview to read it. Of course I selected that 14 letterswithout following any biological criteria, just picked up a letter foreach state.-For my dataset the P letter represents a NULL state meaning thatthere is no time point evaluation for the subject , so it is used whenneeded for every subject to complete the 600 time points.

P is, unfortunately, a fairly bad choice. It doesn't matter for percentidentity, but generally, P is a 'special' amino acid, and rarelymutates. See below..

Is it in correct FASTA format?

If jalview read it, and the symbols appear in the alignment window inthe way you would expect, then it is correct!

I've tried some trees and looks fine for average distance, neighbourdistance looks strange..

NJ can often look strange if the sequences are not well-related. In thiscase, P112 has been chosen as the 'outgroup' - the one that appearsfurthest away from all other sequences, the rest of the individuals seemto fall into two more closely related groups.

What option did you use to calculate each of these ? Percentageidentity and BLOSUM62 will give different trees - you'll need to use thelatter if you want distances to account for similar states.

I don't know if there's some way to take advantage of the followinginformation, to give some semantics to the clusters:- A and N states represents situations in life that are related , forexample both of them refer to EDUCATION- R, C, E, Q, L states represents also similar life situations, forexample all of them deal with FAMILY
- D, G, H, K, M represent EMPLOYMENT/JOBS
- P state represents NULL
-F represents a mixture of states that have no interpretation for themoment.

You will need to recode your states to map similar states to similaramino acids - then you'll be able to take advantage of the intrinsicamino acid similarities that the conservation and blosum62 measures employ.

The a Venn diagram here:http://www.jalview.org/help/html/misc/aaproperties.html which indicatesthe various properties shared by different amino acids. If you want tohave a 'NULL' state, then G is probably the best one to choose - but youcan also use '-' - the gap character. Gaps are treated specially, andmight actually be closest to what you would consider the 'null's toindicate.

I think that useful information that I should use comes from theConservation, Quality and Consensus graphs shown below the sequencesIf I understand it well, the Consensus Graph shows for each of my 600time points the most frequent states and their %
I'm having most Fs at the begining and Ps thereafter.
Could the Quality measure be useful for my sequences?

if you re-encode your states according to the amino acid groupings, thenyou'll certainly get some informatino from the quality and conservationmeasures. Conservation measures the number of common properties for theamino acids in a column, and quality measures the average score for themutations observed in a column - so unlikely transitions will result ina lower quality score.

The Jmol visualization looks really nice!!!!!!!!!!!!!!!!!! Could itmake sense to assign some structure to my sequences and try tovisualize them like that?

almost certainly not, I'm afraid :)

Would it be useful for finding some kind of regularity or relation?
Excuse for the naive question but what do you use Jmol visualization for?

Jmol is a molecular structure viewer. The sequences Jalview normallyhandles are 'shorthand' for biological molecules. Similar sequences -particularly evolutionarily related ones - have a similar 3D molecularstructure, and can often perform the same kinds of chemical interactions(because they have similar shapes).

I've been programming in Java and I think that it would be really niceif I could apply/adapt some of the functionalities to my problem.

OK. You would certainly be able to create a special set of parametersfor your symbols. The matrix encoding the amino acid groupings ishard-coded into Jalview's source - so it would be straightforward foryou to change the groupings to better reflect the way you are using thesymbols. Take a look at the various matrices in this file:


http://source.jalview.org/gitweb/?p=jalview.git;a=blob_plain;f=src/jalview/schemes/ResidueProperties.java;hb=refs/heads/master

The basic philosophy here is that jalview maps the letter in eachsequence to an amino acid or nucleotide index which is used to map in tothe various score matrices and property vectors (the indexes are givenby aaIndex and nucleotideIndex). There are two types of score modelsused here, whilst the rest of the file contains hard-coded colour mapsfor the built in colour schemes, and look up tables for convertingindexes into text for displaying the name of the amino acid ornucleotide to the user.The two types of models are substitution matrices - which reflect thesimilarity between two symbols, and property matrices which allows'conservation' to be calculated, in order to reflect the number ofproperties that are different for the symbols in a particular column.The conservation may be very useful to you - see the paper linked to inthis post for more explanation:http://www.compbio.dundee.ac.uk/pipermail/jalview-discuss/2012-May/000811.html


Jim.

_______________________________________________
Jalview-discuss mailing list
[email protected]
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss

Re: [Jalview-discuss] Rv: Align general sequences

Reply via email to