Hi Andreas
Thanks for having a look. I found the issue: if your sequence is lower
case, the alignment is different. Try aligning with query.toLowerCase()
and target.toLowerCase(), then the output is:
Number of identical residues: 334
% identical query: 0.23405746
% identical query: 0.22845417
vs upper case
Number of identical residues: 1394
% identical query: 0.9768746
% identical query: 0.95348835
So I just inserted a toUpperCase() and it works now.
Regards
Wim
On 20-04-11 19:33, Andreas Prlic wrote:
Hi Wim,
are you sure you are using the correct sequences in your test? When I
run the code at the bottom of this emails I am getting 95 and 97%
sequence ID, which is similar to what you are expecting.
Andreas
Here my code: (using latest code from SVN)
package demo;
import org.biojava3.alignment.Alignments;
import org.biojava3.alignment.Alignments.PairwiseSequenceAlignerType;
import org.biojava3.alignment.SimpleGapPenalty;
import org.biojava3.alignment.SubstitutionMatrixHelper;
import org.biojava3.alignment.template.GapPenalty;
import org.biojava3.alignment.template.PairwiseSequenceAligner;
import org.biojava3.alignment.template.SequencePair;
import org.biojava3.core.sequence.DNASequence;
import org.biojava3.core.sequence.compound.AmbiguityDNACompoundSet;
import org.biojava3.core.sequence.compound.NucleotideCompound;
public class TestDNANeedlemanWunsch {
public static void main(String[] args){
String query =
"AGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGTGAAGCCCAGCTTGCTGGGTGGATCA"
+
"GTGGCGAACGGGTGAGTAACACGTGAGCAACCTGCCCCTGACTCTGGGATAAGCGCTGGAAACGGTGTCT" +
"AATACTGGATATGAGCTACCACCGCATGGTGAGTGGTTGGAAAGATTTTTCGGTTGGGGATGGGCTCGCG" +
"GCCTATCAGCTTGTTGGTGAGGTAATGGCTCACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGACC" +
"GGCCACACTGGGACTGAGACACGGCCCAGACTCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGC" +
"GGAAGCCTGATGCAGCAACGCCGCGTGAGGGACGACGGCTTCGGGTTGTAAACCTCTTTTAGCAGGGAAG" +
"AAGCGAGAGTGACGGTACCTGCAGAAAAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAG" +
"GGCGCAAGCGTTATCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAA" +
"CCCGAGGCTCAACCTNNGGGCTGCAGTGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGATTGGAATTCC" +
"TGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGATCTCTGGGCCGTAAC" +
"TGACGCTGAGGAGCGAAAGGGTGGGGAGCAAACAGGCTTAGATACCCTGGTAGTCCACCCCGTAAACGTT" +
"GGGAACTAGTTGTGGGGTCCTTTCCACGGATTCCGTGACGCACGTAACGCATTAAGTTCCCCGCCTGGGG" +
"AGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGCGGAGCATGCGGATTA" +
"AATCGATGCAACGCGAAGAACCTTACCAAGGCTTGACATACACGAGAACGCTGCAGAAATGTAGAACTCT" +
"TTGGACACTCGTGAACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCG" +
"CAACGAGCGCAACCCTCGTTCTATGTTGCCAGCACGTAATGGTGGGAACTCATGGGATACTGCCGGGGTC" +
"AACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTTCACGCATGCTACA" +
"ATGGCCGGTACAAAGGGCTGCAATACCGTGAGGTGGAGCGAATCCCAAAAAGCCGGTCCCAGTTCGGATT" +
"GAGGTCTGCAACTCGACCTCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATA" +
"CGTTCCCGGGTCTTGTACACACCGCCCGTCAAGTCATGAAAGTCGGTAACACCTGAAGCCGGTGGCCTAA" +
"CCCTTGTGGAGGGAGCCGGTAATTAAA";
String target =
"CTGGCCGCCTGCTTAACACATCCAAGTCGAACGGTGAAGCCCCANCTTACTGGGTGGATCAGTGCCGAAC"
+
"GGGTGAGTAACACGTGAGCAACCTCCCCCTGACTCTGGGATAAGCGCTGGAANCGGTGTCTAATACTGGA" +
"TATGAGCTACCACCGCATGGTGAGTGGTTGGAAAGATTTTTCGGTTGGGGATGGGCTCGCGCCCTATGAG" +
"CTTGTTGGTGAGGTAATGGCTCACCAAGCCGTCGACGGGTAGCCGGCCTGAGAGGGTGACCGNCCACACT" +
"GGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGGAAGCCT" +
"GATTCANCAACCCCGCGTGAGGGACGACGGCCTTCGGGTTGTAAACCTCTTTTAGCAGGGAAGAAGCGAG" +
"AGTGACGGTACCTGCAGAAAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAA" +
"GCGTTATCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAACCCGAGG" +
"CTCAACCTCGGGCCTGCAGTGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGATTGGAATTCCTGGTGTA" +
"GCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGATCTCTGGGCCGTAACTGACGCT" +
"GAGGAGCGAAAGGGTGGGGAGCAAACAGGCTTAGATACCCTGGTAGTCCACCCCGTAAACGTTGGGAACT" +
"AGTTGTGGGGTCCTTTCCACGGATTCCGTGACGCAGCTAACGCATTAAGTTCCCCGCCTGGGGAGTACGG" +
"CCGCAAGGCTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGCGGAGCATGCGGATTAATTCGAT" +
"GCAACGCGAAGAACCTTACCAAGGCTTGACATACACGAGAACGCTGCAGAAATGTAGAACTCTTTGGACA" +
"CTCGTGAACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAG" +
"CGCAACCCTCGTTCTATGTTGCCAGCACGTAATGGTGGGAACTCATGGGATACTGCCGGGGTCAACTCGG" +
"AGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTTCACGCATGCTACAATGGCCG" +
"GTACAAAGGGCTGCAATACCGTGAGGTGGAGCGAATCCCAAAAAGCCGGTCCCAGTTCGGATTGAGGTCT" +
"GCAACTCGACCTCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCC" +
"GGGTCTTGTACACACCGCCCGTCAAGTCATGAAAGTCGGTAACACCTGAAGCCGGTGGCCCAACCCTTGT" +
"GGAGGGAGCCGTCGAAGGTGGGATCGGTAATTAGGACTAAGTCGTAACAAGGTAGCCGTACC";
GapPenalty penalty = new SimpleGapPenalty((short)-14,
(short)-4);
PairwiseSequenceAligner<DNASequence, NucleotideCompound>
aligner =
Alignments.getPairwiseAligner(
new DNASequence(query,
AmbiguityDNACompoundSet.getDNACompoundSet()),
new DNASequence(target,
AmbiguityDNACompoundSet.getDNACompoundSet()),
PairwiseSequenceAlignerType.GLOBAL,
penalty, SubstitutionMatrixHelper.getNuc4_4());
SequencePair<DNASequence, NucleotideCompound>
alignment = aligner.getPair();
System.out.println(alignment);
int identical = alignment.getNumIdenticals();
System.out.println("Number of identical residues: " +
identical);
System.out.println("% identical query: " + identical / (float)
query.length() );
System.out.println("% identical query: " + identical / (float)
target.length() );
}
}
On Mon, Apr 18, 2011 at 8:22 AM, Wim De Smet<[email protected]> wrote:
Hi all,
I've been trying to generate some global alignments with biojava and
comparing them with what needle returns. Doing this, I can't seem to
reproduce needle's alignment with biojava. The score returned from biojava
seems to be worse than that from needle, so I'm not sure what's happening
here.
The sequences are AB004720 and Y17238 (I didn't attach a fasta file to avoid
spamming people, let me know if you want one). I align them with:
GapPenalty penalty = new SimpleGapPenalty((short)-14, (short)-4);
PairwiseSequenceAligner<DNASequence, NucleotideCompound> aligner =
Alignments.getPairwiseAligner(
new DNASequence(query, AmbiguityDNACompoundSet.getDNACompoundSet()),
new DNASequence(target, AmbiguityDNACompoundSet.getDNACompoundSet()),
PairwiseSequenceAlignerType.GLOBAL,
penalty, SubstitutionMatrixHelper.getNuc4_4());
SequencePair<DNASequence, NucleotideCompound>
alignment = aligner.getPair();
This gives me an alignment with only 23% similarity and a gap at the end.
Varying the gap penalties can give me a gap in front too, but that's about
it. When aligning in needle, I get a sequence with a higher score (6784 vs
(-)5862) and 94% similarity (which seems closer to home). Needle I just run
with defaults (so it uses EDNAFULL) and a go/ge of 14/4.
Could this be a bug or am I misunderstanding some of the options?
BTW, if I use a really large gapextend, say -4000, I also get a nullpointer
exception.
TIA,
Wim De Smet
--
Wim De Smet
http://www.straininfo.net/
_______________________________________________
Biojava-l mailing list - [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l
--
Wim De Smet
http://www.straininfo.net/
_______________________________________________
Biojava-l mailing list - [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l