Thank you, David. Since I emailed the mailing list, I may have found a solution independently: please see https://www.biostars.org/p/361700/#362231.
But if I find your solution to be better in some way(s), I'll use it instead. Thank you for taking the time to explain your solution in detail, I truly appreciate it. Sincerely, Anand On Tue, Feb 5, 2019 at 7:01 AM <[email protected]> wrote: > Send EMBOSS mailing list submissions to > [email protected] > > To subscribe or unsubscribe via the World Wide Web, visit > http://mailman.open-bio.org/mailman/listinfo/emboss > or, via email, send a message with subject or body 'help' to > [email protected] > > You can reach the person managing the list at > [email protected] > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of EMBOSS digest..." > > > Today's Topics: > > 1. Re: run parameters for inexact matching with EMBOSS needle > (David Mathog) > 2. Re: run parameters for inexact matching with EMBOSS needle > (David Mathog) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 04 Feb 2019 09:57:06 -0800 > From: David Mathog <[email protected]> > To: [email protected] > Subject: Re: [EMBOSS] run parameters for inexact matching with EMBOSS > needle > Message-ID: <[email protected]> > Content-Type: text/plain; charset=UTF-8; format=flowed > > On 03-Feb-2019 04:18, Anandkumar Surendrarao wrote: > > > - I want to search with FASTA formatted DNA sequence queries that > > are <= > > 100nt in length, > > - against FASTA formatted genomic DNA sequence, > > - for global end-to-end matches (using Needleman Wunsch type global > > search, not Smith Waterman type local matching) > > - but allowing >= 80% identity and >= 80% of query length coverage > > Not sure about needle, except that I expect it would be very slow. > > However, I made a modified version of Lastal that added command line > parameters for identity/query/target cutoff for almost exactly the same > application. It is available here in the "modified programs" directory. > > https://sourceforge.net/projects/lemonade-assemble/ > > There is a Centos 6 64 bit binary there, for any other platform you will > need to recompile. Lastal does not have NW exactly, but it does have a > -T parameter which does something which in many cases is similar. -T 0 > is a local alignment, and -T 1 "extends to the end" (starting from the > local alignment). My use case involved Illumina reads, and what would > happen is if there was a mismatch at 3 or 4 bp from the end of that read > in the alignment against the target (in this case mRNA transcripts) -T 0 > would truncate the alignment at the position before the mismatch, so the > read would not be seen as a full length match. With -T 1 it would pick > up the extra matching bases. Anyway, let's say you have > > reads.fasta # bunch of 100bp reads > genome.fasta # self explanatory > > CPUS=40 > lastdb -P $CPUS -w 10 genome_10 genome.fasta > lastal -P $CPUS -T1 -I80 -Y80 -m250 -O50 -8 1 -9 1 \ > -f BlastTab genome_10 reads.fasta > results.txt 2>errors.log > > > The lastdb command makes a database but uses only every 10th position. > This will make the database smaller and the search much faster. These > positions are only used as seeds, so if the expected match is very good > -w10 gives the same results as -w1 (default). If the match is expected > to be poor use -w 1 and expect much longer run times. Set CPUS to match > your system. > > The lastal command looks for 80% identity (-I80), 80% coverage on query > (-Y80), allows up to 250 overlapping matches (-m250, see below), allows > offset overlap alignment as > long as the overlap is at least 50bp (-O50), requires overlap alignments > to go right > to the ends (-8 1 -9 1), and -f BlastTab gives a blast tabular format > output file (except when it is on the other strand the query positions > are flipped rather than the target). Offset overlap is like this: > > ------------------------- > ||||||||||||||| <- -O50 == this must be >= 50bp > ------------------------------ > > This only concerns alignments which overlap the ends. The -m parameter > is a standard lastal command line option: > > -m: maximum initial matches per query position (10) > > What it means is that if there are a lot of alignments in exactly the > same place by default one only retrieves a few of them. Increasing the > value to 250 means many more will be retrieved. Conceivably if there is > super high coverage you might want to use an even larger number. Best > to first try a small subset to determine the appropriate -m value for > your case. > > I don't know if it would be faster if the query/database were the other > way around. > > Regards, > > David Mathog > [email protected] > Manager, Sequence Analysis Facility, Biology Division, Caltech > > > ------------------------------ > > Message: 2 > Date: Mon, 04 Feb 2019 10:32:20 -0800 > From: David Mathog <[email protected]> > To: [email protected] > Subject: Re: [EMBOSS] run parameters for inexact matching with EMBOSS > needle > Message-ID: <[email protected]> > Content-Type: text/plain; charset=UTF-8; format=flowed > > On 04-Feb-2019 09:57, David Mathog wrote: > > lastal -P $CPUS -T1 -I80 -Y80 -m250 -O50 -8 1 -9 1 \ > > -f BlastTab genome_10 reads.fasta > results.txt 2>errors.log > > I should add that for my case it was -I93 -J100. The reads were only > 76bp and > it was common to have a few bp mismatches on the ends, hence 93% > identity. -J instead > of -Y because it was transcripts searching a reads database rather than > the other way around. 80% is pretty low if the short sequences should > be present 1:1 in the assembled genome, even for a highly polymorphic > genome where only one haplotype might be present. > > Regards, > > David Mathog > [email protected] > Manager, Sequence Analysis Facility, Biology Division, Caltech > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > EMBOSS mailing list > [email protected] > http://mailman.open-bio.org/mailman/listinfo/emboss > > ------------------------------ > > End of EMBOSS Digest, Vol 121, Issue 2 > ************************************** >
_______________________________________________ EMBOSS mailing list [email protected] http://mailman.open-bio.org/mailman/listinfo/emboss
