HI Karen,
I know Jim has replied in some detail about how you might use Jalview to
help in your analysis, so I won't add very much here. While Jalview
could perhaps be adapted to help with your research I think you really
should talk to someone/collaborate with someone who has experience of
multivariate data analysis in order to explain the question you are
trying to address and thus devise the most appropriate analysis
solution. Since your strings are all the same length and do not need
alignment, I suggested in my last email that you look at R and the
various mutivariate techniques it implements. You also need to decide
how best to represent the transitions between the states in your
system. There are many possible ways to do this, but while my group
(including Jim) has expertise in this sort of thing, your problem area
is a long way from what we do and it would be hard for us to justify a
collaboration.
So, please use Jalview if you like to visualise your data, but do find
someone in your institution who can sit down with you and discuss the
best way to go about answering the questions you are interested in
answering, particularly with respect to clustering.
I wish you all the best in your research - please let us know when you
find a good solution and particularly where we should look to see the
final publication of your work!
With every good wish,
Geoff.
On 03/06/2012 19:14, Jim Procter wrote:
Hi Karen.
We don't allow attachments on the list, I'm afraid - but I did take a
look at what you sent.
On 03/06/2012 10:24, Karen Keight wrote:
Hi Jim and Geoff.. I've sent a message to the discussion list and it
is waiting for approval because of the attachments.
Below is the message I've sent .. just in case it does not arrive to
the discussion list..
I think that there's quite a lot of functionality in Jalview that I
could apply to my problem and I've started but I'm not sure if it is
ok so far and how to proceed from now on.
You seem to have done a fair amount of experimentation! I do have a
couple of comments:
I've been arranging my dataset for processing it in Jalview, I attach
it here (testdata.txt) in FASTA format (hopefully)
- Now I have 123 participants, each one is a sequence of 600 time
points.
- At each time point a participant can be in one of 14 states, I
used 14 Protein letters (A,R,N,D,C,E,Q,G,H,L,K,M,F,P) to represent my
states for Jalview to read it. Of course I selected that 14 letters
without following any biological criteria, just picked up a letter
for each state.
-For my dataset the P letter represents a NULL state meaning that
there is no time point evaluation for the subject , so it is used
when needed for every subject to complete the 600 time points.
P is, unfortunately, a fairly bad choice. It doesn't matter for
percent identity, but generally, P is a 'special' amino acid, and
rarely mutates. See below..
Is it in correct FASTA format?
If jalview read it, and the symbols appear in the alignment window in
the way you would expect, then it is correct!
I've tried some trees and looks fine for average distance, neighbour
distance looks strange..
NJ can often look strange if the sequences are not well-related. In
this case, P112 has been chosen as the 'outgroup' - the one that
appears furthest away from all other sequences, the rest of the
individuals seem to fall into two more closely related groups.
What option did you use to calculate each of these ? Percentage
identity and BLOSUM62 will give different trees - you'll need to use
the latter if you want distances to account for similar states.
I don't know if there's some way to take advantage of the following
information, to give some semantics to the clusters:
- A and N states represents situations in life that are related , for
example both of them refer to EDUCATION
- R, C, E, Q, L states represents also similar life situations, for
example all of them deal with FAMILY
- D, G, H, K, M represent EMPLOYMENT/JOBS
- P state represents NULL
-F represents a mixture of states that have no interpretation for the
moment.
You will need to recode your states to map similar states to similar
amino acids - then you'll be able to take advantage of the intrinsic
amino acid similarities that the conservation and blosum62 measures
employ.
The a Venn diagram here:
http://www.jalview.org/help/html/misc/aaproperties.html which
indicates the various properties shared by different amino acids. If
you want to have a 'NULL' state, then G is probably the best one to
choose - but you can also use '-' - the gap character. Gaps are
treated specially, and might actually be closest to what you would
consider the 'null's to indicate.
I think that useful information that I should use comes from the
Conservation, Quality and Consensus graphs shown below the sequences
If I understand it well, the Consensus Graph shows for each of my 600
time points the most frequent states and their %
I'm having most Fs at the begining and Ps thereafter.
Could the Quality measure be useful for my sequences?
if you re-encode your states according to the amino acid groupings,
then you'll certainly get some informatino from the quality and
conservation measures. Conservation measures the number of common
properties for the amino acids in a column, and quality measures the
average score for the mutations observed in a column - so unlikely
transitions will result in a lower quality score.
The Jmol visualization looks really nice!!!!!!!!!!!!!!!!!! Could it
make sense to assign some structure to my sequences and try to
visualize them like that?
almost certainly not, I'm afraid :)
Would it be useful for finding some kind of regularity or relation?
Excuse for the naive question but what do you use Jmol visualization
for?
Jmol is a molecular structure viewer. The sequences Jalview normally
handles are 'shorthand' for biological molecules. Similar sequences -
particularly evolutionarily related ones - have a similar 3D molecular
structure, and can often perform the same kinds of chemical
interactions (because they have similar shapes).
I've been programming in Java and I think that it would be really
nice if I could apply/adapt some of the functionalities to my problem.
OK. You would certainly be able to create a special set of parameters
for your symbols. The matrix encoding the amino acid groupings is
hard-coded into Jalview's source - so it would be straightforward for
you to change the groupings to better reflect the way you are using
the symbols. Take a look at the various matrices in this file:
http://source.jalview.org/gitweb/?p=jalview.git;a=blob_plain;f=src/jalview/schemes/ResidueProperties.java;hb=refs/heads/master
The basic philosophy here is that jalview maps the letter in each
sequence to an amino acid or nucleotide index which is used to map in
to the various score matrices and property vectors (the indexes are
given by aaIndex and nucleotideIndex). There are two types of score
models used here, whilst the rest of the file contains hard-coded
colour maps for the built in colour schemes, and look up tables for
converting indexes into text for displaying the name of the amino acid
or nucleotide to the user.
The two types of models are substitution matrices - which reflect the
similarity between two symbols, and property matrices which allows
'conservation' to be calculated, in order to reflect the number of
properties that are different for the symbols in a particular column.
The conservation may be very useful to you - see the paper linked to
in this post for more explanation:
http://www.compbio.dundee.ac.uk/pipermail/jalview-discuss/2012-May/000811.html
Jim.
_______________________________________________
Jalview-discuss mailing list
[email protected]
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss
--
Geoff Barton, Professor of Bioinformatics, College of Life Sciences
University of Dundee, Scotland, UK. [email protected]
Tel:+44 1382 385860/388731 (Fax:385764) www.compbio.dundee.ac.uk
The University of Dundee is registered Scottish charity: No.SC015096
_______________________________________________
Jalview-discuss mailing list
[email protected]
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss