Hi Karen.
We don't allow attachments on the list, I'm afraid - but I did take a
look at what you sent.
On 03/06/2012 10:24, Karen Keight wrote:
Hi Jim and Geoff.. I've sent a message to the discussion list and it
is waiting for approval because of the attachments.
Below is the message I've sent .. just in case it does not arrive to
the discussion list..
I think that there's quite a lot of functionality in Jalview that I
could apply to my problem and I've started but I'm not sure if it is
ok so far and how to proceed from now on.
You seem to have done a fair amount of experimentation! I do have a
couple of comments:
I've been arranging my dataset for processing it in Jalview, I attach
it here (testdata.txt) in FASTA format (hopefully)
- Now I have 123 participants, each one is a sequence of 600 time points.
- At each time point a participant can be in one of 14 states, I used
14 Protein letters (A,R,N,D,C,E,Q,G,H,L,K,M,F,P) to represent my
states for Jalview to read it. Of course I selected that 14 letters
without following any biological criteria, just picked up a letter for
each state.
-For my dataset the P letter represents a NULL state meaning that
there is no time point evaluation for the subject , so it is used when
needed for every subject to complete the 600 time points.
P is, unfortunately, a fairly bad choice. It doesn't matter for percent
identity, but generally, P is a 'special' amino acid, and rarely
mutates. See below..
Is it in correct FASTA format?
If jalview read it, and the symbols appear in the alignment window in
the way you would expect, then it is correct!
I've tried some trees and looks fine for average distance, neighbour
distance looks strange..
NJ can often look strange if the sequences are not well-related. In this
case, P112 has been chosen as the 'outgroup' - the one that appears
furthest away from all other sequences, the rest of the individuals seem
to fall into two more closely related groups.
What option did you use to calculate each of these ? Percentage
identity and BLOSUM62 will give different trees - you'll need to use the
latter if you want distances to account for similar states.
I don't know if there's some way to take advantage of the following
information, to give some semantics to the clusters:
- A and N states represents situations in life that are related , for
example both of them refer to EDUCATION
- R, C, E, Q, L states represents also similar life situations, for
example all of them deal with FAMILY
- D, G, H, K, M represent EMPLOYMENT/JOBS
- P state represents NULL
-F represents a mixture of states that have no interpretation for the
moment.
You will need to recode your states to map similar states to similar
amino acids - then you'll be able to take advantage of the intrinsic
amino acid similarities that the conservation and blosum62 measures employ.
The a Venn diagram here:
http://www.jalview.org/help/html/misc/aaproperties.html which indicates
the various properties shared by different amino acids. If you want to
have a 'NULL' state, then G is probably the best one to choose - but you
can also use '-' - the gap character. Gaps are treated specially, and
might actually be closest to what you would consider the 'null's to
indicate.
I think that useful information that I should use comes from the
Conservation, Quality and Consensus graphs shown below the sequences
If I understand it well, the Consensus Graph shows for each of my 600
time points the most frequent states and their %
I'm having most Fs at the begining and Ps thereafter.
Could the Quality measure be useful for my sequences?
if you re-encode your states according to the amino acid groupings, then
you'll certainly get some informatino from the quality and conservation
measures. Conservation measures the number of common properties for the
amino acids in a column, and quality measures the average score for the
mutations observed in a column - so unlikely transitions will result in
a lower quality score.
The Jmol visualization looks really nice!!!!!!!!!!!!!!!!!! Could it
make sense to assign some structure to my sequences and try to
visualize them like that?
almost certainly not, I'm afraid :)
Would it be useful for finding some kind of regularity or relation?
Excuse for the naive question but what do you use Jmol visualization for?
Jmol is a molecular structure viewer. The sequences Jalview normally
handles are 'shorthand' for biological molecules. Similar sequences -
particularly evolutionarily related ones - have a similar 3D molecular
structure, and can often perform the same kinds of chemical interactions
(because they have similar shapes).
I've been programming in Java and I think that it would be really nice
if I could apply/adapt some of the functionalities to my problem.
OK. You would certainly be able to create a special set of parameters
for your symbols. The matrix encoding the amino acid groupings is
hard-coded into Jalview's source - so it would be straightforward for
you to change the groupings to better reflect the way you are using the
symbols. Take a look at the various matrices in this file:
http://source.jalview.org/gitweb/?p=jalview.git;a=blob_plain;f=src/jalview/schemes/ResidueProperties.java;hb=refs/heads/master
The basic philosophy here is that jalview maps the letter in each
sequence to an amino acid or nucleotide index which is used to map in to
the various score matrices and property vectors (the indexes are given
by aaIndex and nucleotideIndex). There are two types of score models
used here, whilst the rest of the file contains hard-coded colour maps
for the built in colour schemes, and look up tables for converting
indexes into text for displaying the name of the amino acid or
nucleotide to the user.
The two types of models are substitution matrices - which reflect the
similarity between two symbols, and property matrices which allows
'conservation' to be calculated, in order to reflect the number of
properties that are different for the symbols in a particular column.
The conservation may be very useful to you - see the paper linked to in
this post for more explanation:
http://www.compbio.dundee.ac.uk/pipermail/jalview-discuss/2012-May/000811.html
Jim.
_______________________________________________
Jalview-discuss mailing list
[email protected]
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss