Specifically, I am a molecular biologist. I have a set of 700+ nucleotide sequences i want to group into clusters based on sequence similarities. There is a wide range of sequences in the set, some of which are homologous to other sequences in the set. I want to use clustering to identify these groups.
If the sequences were related and good be trimmed to the same length, I would do an alignment and then use phylip (or some other distance method) to create a distance matrix, but since my sequences are unrelated and cannot be trimmed to the same length, I am at a loss for what to do.
For a set with so many unrelated sequences of different lengths, the only thing I have been able to is an all-against-all BLAST to create the matrix, but this gives high scores for similarities, not high scores for dissimilarities. The only thought I had was to use the reciprocal of the BLAST score as some perverse measure of distance.
I am not subscribed to the list, so can I ask for responses directly to my email address?
Thank-you, Tom Isenbarger
-- [EMAIL PROTECTED] thomas a isenbarger (608) 265-0850
______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html