Hi Karen If I understand you correctly, you want 'N' bases to be totally "invisible" during the generation and scoring of the alignment.
To score the alignment in the way you describe would, I think, require probably trivial reprogramming of needle, via a new "advanced" option. Could be done for the next release or sooner, how urgent do you need it? Scoring in the way you describe is a reasonable thing to do, but if N is not to contribute to the score it should not contribute to the alignment either, so you'd need to adjust the scoring matrix so that all substitutions involving N are neutral - I guess by specifying a value of zero for them. Just my two penneth' :) Cheers Jon > I have an additional question about needle, as I would like to > actually remove noninformative bases from the final alignment score: > > ie. If the sequence follows > -CATTCNNNCA- > -CATTCAAACA- > > With suggested matrix weight changes I would expect to see a 100% > similarity of 10/10 bases > However, it is more informative to me to to see 100% similarity of 7/7 > bases (with N no longer aiding my alignment score). One could imagine > an artificial similarity score inflation if the entire length is used > to generate the score...ie. if 100 bases were being aligned to 100 bp > sequence (containing 10 "Ns"), and then 5 of those bases were an > informative mismatch: > > Needle would currently provide: > 95/100 (or simply 95% similarity) > > But the answer needed would be: > 85/90 (or 94.4% similarity). > > Does this make sense? > Thank you in advance for any help you can offer! > > Karen > > > > On 2/8/07, Karen Hayden <[EMAIL PROTECTED]> wrote: >> Hey Peter, >> That was absolutely perfect. Thank you! >> >> Best wishes, >> Karen >> >> >> On 2/8/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >> > Dear Karen, >> > >> > > I am currently using needle to generate an alignment between two >> > > sequences which contain non-informative bases (ie, identified low >> > > quality bases (phred scores) and have been changed to "N"). >> > > Presently, these bases are penalized as any other non-matching >> > > character. Is there any way to change needle to "overlook" these >> > > bases when generating the best scoring alignment (or, do I need to >> > > write my own version of needle?) >> > >> > There are two matrix files for nucleotide comparisons. The default is >> > EDNAFULL which counts N as an average of all possible scores (1 match >> > against 3 possible mismatches). >> > >> > The alternative is EDNAMAT which only scores exact matches like blastn >> > (use -data EDNAMAT on the command line to see the difference). >> > >> > But you can also copy EDNAMAT to your local directory with >> > >> > embossdata EDNAFULL -fetch >> > mv EDNAFULL EDNAPHRED >> > (best to do this rename or you will accidentally be using this file by >> > default for other needle runs in the same directory) >> > >> > edit EDNAPHRED to have the scores you want for N (perhaps +1 for a small >> > match to ACGTU, +2 for a match to a 2-base code RYSWKM, +3 for a match to >> > a 3-base code BDHV and +4 for a match to another N. >> > >> > Then run with: >> > >> > needle -data EDNAPHRED >> > >> > If enough users think this is a meaningful scoring system we could add >> > such a matrix to the distribution. Let us know if it really gives you more >> > useful scores. My natural prejudice is to trust EDNAFULL. I guess you are >> > expecting to often find the base in the other sequence is the one phred >> > started with, which will indeed bias the scoring. >> > >> > Hope this helps, >> > >> > Peter >> > >> > >> > >> >> >> -- >> Karen E. Hayden >> Starving Graduate Student >> Duke University >> Durham, NC 27708 >> > > > -- > Karen E. Hayden > Starving Graduate Student > Duke University > Durham, NC 27708 > _______________________________________________ > EMBOSS mailing list > [email protected] > http://lists.open-bio.org/mailman/listinfo/emboss > _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
