> After trimming a bunch of illumina reads I have sequences that are 12bp long and "we" want to find > their genomic location
location(S!) > > I have both a perl regex and fuzznuc running now for oh about a week.......... It shouldn't take that long. Assuming the words are perfectly known (so identity match only), and you have more than one of these words to do, the fastest way would be to break the genome(s) down into words, sort that form, and then just pull out matching word positions from the resulting table(s). If on the other hand, you are allowing mismatches, well, ugh. 12 is pretty small to start with... > > I have also attempted to build/configure/run without success > maq > soap > blat Here is a tool which is very simple to get you started: http://saf.bio.caltech.edu/pub/software/molbio/fastatuples.c Do fastatuples -h to see the options. Run it like (for instance) fastatuples -n -w 12 -d -p -s <dna.nfa 1: 0 GAATTCTGCTCC 1: 1 AATTCTGCTCCA 1: 2 ATTCTGCTCCAG 1: 3 ACTGGAGCAGAA 1: 4 AACTGGAGCAGA 1: 5 CAACTGGAGCAG 1: 6 GCAACTGGAGCA 1: 7 GCTCCAGTTGCC 1: 8 CTCCAGTTGCCA etc. Where the first column is the sequence entry, the second the offset, and the third the sequence. (With -d it emits "degenerate sequence", that is, it compares the forward and reverse sequences and puts out the one with the smaller collating sequence.) The output will be of size 31 X length of genome. You can make it smaller by using the -b (binary output) option. When that is done sort it using: the 3rd column as the primary key, the 1st column as the secondary key, and the 2nd column as the tertiary key. (Trickier with binary output.) At that point if desired you can trivially reduce it with a perl script to emit something like: CTCCAGTTGCCA (1:8,30,501...),(2:1000,...) etc. Then just look up the matching rows for each word. Or leave it unpacked, and use a binary search in the sorted file to find the words you want. Regards, David Mathog [email protected] Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
