Hi,
As I mentioned in previous posts, I'm using the drosophila DSCAM genes for
testing some tools.
I assembled a fasta file composed of 3 fly entries,
$ cat all_fasta | grep ">"
>AF260530 Drosophila melanogaster Dscam gene, complete cds.
>DQ317106 Drosophila yakuba Dscam gene, exons 3 through 24.
>DQ317109 Drosophila pseudoobscura Dscam gene, exons 3 through 24.
and tried aligning them with clustalw but minutes later still didn't have a
result. I was wondering if
someone could suggest a set of parameters or alternative alignment tool to do a
fast
alignment, even if a bit sloppy. I had always used to slow/accurate approach
and don't
know what options may be available for faster work- these sequences are each
about 50k long.
In the meantime, I was able to get a satisfactory result using exact string
matches using successively
shorter and shorter strings. This approach yields acceptable results in under a
minute and, if needed, you
could segment the questionable areas and feed them to clustal or other tool for
"better" alignment.
It seems to be fast due to only comparing sequences to a reference sequence (
O(n*l^2) but "l" can be smaller
than sequence length as unique features can be found O(l*log(l)) ) . There
are, of course, likely to
be various pathological cases but for sequences known to be similar it seems to
work ok and the indexing
feature allows extraction of substrings with particular distributions (
occuring only once in each sample for example).
I have aligned 2 ecoli strains in perhaps a few minutes and there weren't any
obvious pathological
results ( I obviously didn't check the whole thing either by eye or
programatically).
Others have asked about testing method, I'd like to show how I'm going about
this with the DSCAM example.
The alignment is only one part of more general interest in finding
similar/different features between samples.
These sequences, it turns out, have exon locations in the ncbi entries. So, it
was pretty easy to check the alignments
by examining the locations of the exons in the aligned composite. In this case,
I aligned as follows,
$ string_test -fastas all_fasta -index 8 -length 25 -fix 12 -output 3 -filterN
-filterID -status -fcompare_all> anchor_hits
and could feed the "anchor_hits" locations to an aligner that could start with
these ( actually, it can now start on its own
but that is a developmental issue ) and refine subsections ( or spot gross
transpositions perhaps, would have to check),
$ mm_align_tool -fastas all_fasta -pair_rules anchor_hits -ref 0 -all_samples
-refine marlow -pair_params uniq -pair_align monotonize -doall -dest
algned_fasta -output fasta
The aligned sequences can be output in several formats and the "algned_fasta"
file can be presented along with various
rules or annotations using another tool to create bmp, html, or txt files :
( right now it requires some input parameters, hence the dummy echo)
716 echo | $progpath/annotater -source manno.src> testcomp5.html
464 echo | $progpath/annotater -source bmp.src
585 echo |$progpath/annotater -source text.src>text.txt 2>junk
where the ".src" file just contains command line parameters,
$ cat text.src
-width 120 -font $progpath/4x6-KOI8.pcf
-mrules fixed_exons2
-merge_rules comp_5_hitss
-font /cygdrive/c/mydocs/scripts/cc/affx//4x6-KOI8-R.pcf
-acid_rank /cygdrive/c/mydocs/scripts/cc/affx//nmstrings
-acid_map 20
-xlate -inter -banner
-annotate algned_fasta
So, I could first look at a different alignment metric by outputting a table of
correspondences between input and aligned locations,
602 $progpath/mm_align_tool -fastas all_fasta -pair_rules anchor_hits -ref 0
-all_samples -refine marlow -pair_params uniq -pair_align monotonize -doall
-dest xxxx_raw -output table
and using the table to move "absolute location" rules to their location in the
aligned sequence:
$table_tool -v -table table_table -table_rules ncbi_exons | sed -e
's/exon/exon /g'| sed -e 's/[.:]/ /g' | sort -k 3 -g | more
This generates a cryptic feature comparison map which shows that most of the
exons
end up in the same location on each sequence but see the publication below,
even the differently named exons were aligned from different species in most
cases
( these exon rules are followed by {sequence number, aligned position, offset
from first entry } ):
>exon |1|exon 3 {2,16121,0}{3,16121,0}{4,16157,36}
>exon |1|exon 4 1 {2,18582,0}{3,18582,0}{4,18582,0}
>exon |1|exon 4 10 {2,22532,0}{3,22532,0}{4,22532,0}
>exon |1|exon 4 11 {2,22836,0}{3,22836,0}{4,22845,9}
>exon |1|exon 4 12 {2,23217,0}{3,23217,0}{4,23217,0}
>exon |1|exon 4 2 {2,19006,0}{3,19006,0}{4,19006,0}
>exon |1|exon 4 3 {2,19736,0}{3,19736,0}{4,19736,0}
>exon |1|exon 4 4 {2,20545,0}{3,20545,0}{4,20545,0}
>exon |1|exon 4 5 {2,20872,0}{3,20872,0}{4,20872,0}
>exon |1|exon 4 6 {2,21269,0}{3,21269,0}{4,21269,0}
>exon |1|exon 4 7 {2,21597,0}{3,21597,0}{4,21597,0}
>exon |1|exon 4 8 {2,21895,0}{3,21895,0}{4,21895,0}
>exon |1|exon 4 9 {2,22229,0}{3,22229,0}{4,22212,-17}
>exon |1|exon 5 {2,25020,0}{3,25020,0}{4,25020,0}
>exon |1|exon 6 1 {2,27251,0}{3,27249,-2}{4,27251,0}
>exon |1|exon 6 10 {2,29659,0}{3,29826,167}{4,29451,-208}
>exon |1|exon 6 11 {2,29845,0}{3,30074,229}{4,29640,-205}
>exon |1|exon 6 12 {2,30074,0}{3,30614,540}{4,30114,40}
>exon |1|exon 6 13 {2,30614,0}{3,31054,440}{4,30438,-176}
>exon |1|exon 6 14 {2,30831,0}{3,31255,424}{4,30614,-217}
>exon |1|exon 6 15 {2,31054,0}{3,31475,421}{4,30831,-223}
>exon |1|exon 6 16 {2,31255,0}{3,31684,429}{4,31054,-201}
>exon |1|exon 6 17 {2,31475,0}{3,32161,686}{4,32160,685}
>exon |1|exon 6 18 {2,31688,0}{3,32437,749}{4,32376,688}
>exon |1|exon 6 19 {2,31926,0}{3,33251,1325}{4,32590,664}
>exon |1|exon 6 2 {2,27524,0}{3,27524,0}{4,27497,-27}
...etc...
>exon |1|exon 14 {2,66424,0}{3,66424,0}{4,66424,0}
>exon |1|exon 15 {2,66631,0}{3,66631,0}{4,66631,0}
>exon |1|exon 16 {2,66884,0}{3,66884,0}{4,66884,0}
>exon |1|exon 17 1 {2,70469,0}{3,70469,0}{4,70469,0}
>exon |1|exon 17 2 {2,71004,0}{3,71004,0}{4,71004,0}
>exon |1|exon 18 {2,72995,0}{3,72995,0}{4,72995,0}
>exon |1|exon 19 {2,74100,0}{3,74100,0}{4,74063,-37}
>exon |1|exon 20 {2,74948,0}{3,74948,0}{4,74948,0}
>exon |1|exon 21 {2,75334,0}{3,75334,0}{4,75334,0}
>exon |1|exon 22 {2,75594,0}{3,75594,0}{4,75594,0}
>exon |1|exon 23 {2,76979,0}{3,76979,0}{4,76997,18}
>exon |1|exon 24 {2,78233,0}{3,78233,0}{4,78233,0}
that can be reconciled with known inter-species exon similarities, for example
http://www.pubmedcentral.nih.gov/picrender.fcgi?artid=1431710&blobtype=pdf
It turns out that the exon3 offset for sequence 4 is probably due to a rule
issue, not an alignment issue
( excerpt from the test alignment map including aligner diagnostics such as '{'
in fasta file and
likely translation products in all 3 fwd frames ) as the alignment in this
high-similarity region appears good:
New Section : 16080 to 16200 section 134 of 669
8 9 0 1 2 3 4 5
6 7 8 9
>>AF260530 Drosophila melanogaster Dscam gene, complete cds.
.....................................}AGCTTGTGGTAGTCAGACCCTAGCTGCCAATCCCCCAGATGCCGACCAAAAAGGACCCGTCTTCCTCAAGGAACCCACCAAC
**************************************SALLCVWGV*SVSQRDTPPL*SALCAPQNISPPPPQRDMCAPRDTPQKKKKRGDTPPRVSLFSPLSQKRGENTPPHTPQN
>>AF260530 Drosophila melanogaster Dscam gene, complete cds.
.....................................}AGCTTGTGGTAGTCAGACCCTAGCTGCCAATCCCCCAGATGCCGACCAAAAAGGACCCGTCTTCCTCAAGGAACCCACCAAC
**************************************SALLCVWGV*SVSQRDTPPL*SALCAPQNISPPPPQRDMCAPRDTPQKKKKRGDTPPRVSLFSPLSQKRGENTPPHTPQN
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>exon|1|exon3 XXX
>>DQ317106 Drosophila yakuba Dscam gene, exons 3 through 24.
TATTTACTAATTGGCGGCGTTGTTCTTGTTTCATTTC}AGCTGGTGGTAGTCAGACCCTGGCTGCCAATCCCCCCGATGCCGACCAAAAAGGACCCGTCTTTCTCAAGGAACCCACCAAC
YIFLYTL*NILWGARGARVLCVFSLLCVFFSHIFF***SALWGVWGV*SVSQRDTPPLWGALCAPQNISPPPPPRDMCAPRDTPQKKKKRGDTPPRVSLFFSLSQKRGENTPPHTPQN
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>exon|1|exon3 XXX
>>DQ317109 Drosophila pseudoobscura Dscam gene, exons 3 through 24.
??????????????????????????????????????AGCTTGTGGCAGTCAGACTTTGGCTGCCAATCCACCAGATGCCGACCAGAAGGGACCCGTCTTCCTCAAAGAGCCCACCAAC
**************************************SALLCVWGAQSVSQRDTLFLWGALCAPQNISPHTPQRDMCAPRDTPQREKRGGDTPPRVSLFSPLSQKKRESAPPHTPQN
+++++++++++++++++++++++++++++++++++++++++++
>exon|1|exon3 XXX
I'm aware of the following related alignment literature, open to ideas:
$ string_test -about|unix2dos >/dev/clipboard
Contact: [EMAIL PROTECTED] Nov 2007
Comment: uses some indexing to get speed up,
Comment: motivation for RC rules from this etc ,
Ref:http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1431710
Commment: and should work well on text or (modified slightly ) binary code too
Note: More code in mm_align_tool
Note: Based loosely on references such as these but 'common sense'
Note: seemed to work well as these are after-the-fact lookups
Ref:
http://www.google.com/search?hl=en&safe=off&q=string+alignment+site%3Aciteseer.ist.psu.edu
Ref: http://citeseer.ist.psu.edu/csuros05rapid.html
Comment: Csuros, M., Ma, B.: Rapid homology search with two-stage extension and
Comment: daughter seeds. In: Proc. 11th Int. Computing and Combinatorics Conf.
(COCOON).
Comment: Volume 3595 of LNCS., Springer-Verlag (2005) 104-- 114
Ref: http://citeseer.ist.psu.edu/468459.html
Ref: http://citeseer.ist.psu.edu/kahveci04speeding.html
Feb 2 2008 09:35:40 string_test.h182
Thanks.
Mike Marchywka
586 Saint James Walk
Marietta GA 30067-7165
404-788-1216 (C)<- leave message
989-348-4796 (P)<- emergency only
[EMAIL PROTECTED]
Note: Hotmail is blocking my mom's entire
ISP claiming it is to reduce spam but probably
to force users to use hotmail. Please DON'T
assume I am ignoring you and try
me on [EMAIL PROTECTED] if no reply
here. Thanks.
>
_________________________________________________________________
Need to know the score, the latest news, or you need your HotmailĀ®-get your
"fix".
http://www.msnmobilefix.com/Default.aspx
_______________________________________________
BBB mailing list
[email protected]
http://www.bioinformatics.org/mailman/listinfo/bbb