[replying to your personal address as well as the list; but I think you should subscribe to the list since this topic may well be pursued further]
On 08-Dec-04 Dr. Thomas Isenbarger wrote: > I have a matrix of similarity scores that I want to convert into a > matrix of dissimilarity scores so that I can apply some clustering > methods to the data. That is, high values in my matrix signify > similarity and low values (zero being the lowest) signify no > similarity. What functions/options in R or its packages are available > for making this kind of transformation of a matrix? > > Specifically, I am a molecular biologist. I have a set of 700+ > nucleotide sequences i want to group into clusters based on sequence > similarities. There is a wide range of sequences in the set, some of > which are homologous to other sequences in the set. I want to use > clustering to identify these groups. > > If the sequences were related and good be trimmed to the same length, I > would do an alignment and then use phylip (or some other distance > method) to create a distance matrix, but since my sequences are > unrelated and cannot be trimmed to the same length, I am at a loss for > what to do. > > For a set with so many unrelated sequences of different lengths, the > only thing I have been able to is an all-against-all BLAST to create > the matrix, but this gives high scores for similarities, not high > scores for dissimilarities. The only thought I had was to use the > reciprocal of the BLAST score as some perverse measure of distance. > > I am not subscribed to the list, so can I ask for responses directly to > my email address? Clearly any function which "inverts" the measure of "similarity" (i.e. decreases as "similarity" increases) could be used as a measure of dissimilarity in general. Indeed you imply as much yourself. There is quite a wide choice ... "reciprocal" could be one. However, reading between your lines, it seems that you do not have a substantive interpretation for "dissimilarity". Yet apparently you have one for "similarity". Otherwise, on what basis do you claim that your similarity matrix expresses *substantive* similarity? But, if you can attach an interpretation (in some substantive terms) to your measure of similarity, can you not then negate the propositions that this expresses and obtain a measure of dissimilarity? In that case, the function could be programmed in R (though it may not be a function of your "similarity" and. you would need to derive it from the data). If not, why not? Or, if your measure of "similarity" in fact does not carry a substantive interpretation, then one could assert that any decreasing function of "similarity" could be used, and would be as meaningful as your measure of "similarity". Again, this can be programmed in R. Again reading between your lines, it could be inferred that in the situation you describe ("unrelated sequences" which "cannot be trimmed to the same length"), while you can derive a measure of similarity which matches established concepts for similarity in your field, you cannot match the concepts for dissimilarity. If that is the case, R cannot help you with the conceptual problem. This may appear not helpful, but it is a sincere attempt to clarify the issues. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 08-Dec-04 Time: 23:10:55 ------------------------------ XFMail ------------------------------ ______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html