On 10/05/10 15:06, chaoticcran...@gmail.com wrote:
On Oct 5, 3:38 pm, MRAB<pyt...@mrabarnett.plus.com>  wrote:
On 05/10/2010 20:03, chaoticcran...@gmail.com wrote:



So, I have a rather tricky string comparison problem: I want to search
for a set pattern in a variable source.

To give you the context, I am searching for set primer sequences
within a variable gene sequence. In addition to the non-degenerate A/G/
C/T, the gene sequence could have degenerate bases that could encode
for more than one base (for example, R means A or G, N means A or G or
C or T). One brute force way to do it would be to generate every
single non-degenerate sequence the degenerate sequence could mean and
do my comparison with all of those, but that would of course be very
space and time inefficient.

For the sake of simplicity, let's say I replace each degenerate base
with a single wildcard character "?". We can do this because there are
so many more non-degenerate bases that the probability of a degenerate
mismatch is low if the nondegenerates in a primer match up.

So, my goal is to search for a small, set pattern (the primer) inside
a large source with single wildcard characters (my degenerate gene).

The first thing that comes to my mind are regular expressions, but I'm
rather n00bish when it comes to using them and I've only been able to
find help online where the smaller search pattern has wildcards and
the source is constant, such as here:
http://www.velocityreviews.com/forums/t337057-efficient-string-lookup...

Of course, that's the reverse of my situation and the proposed
solutions there won't work for me. So, could you help me out, oh great
Python masters? *bows*

Stand back, I'm going to try regex. :-)

Both "A" and "R" in the variable sequence should match "A" in the
primer sequence, so "A" in the primer sequence should be replaced by
the character set "[AR]". The other bases should be replaced similarly.

Use a simple dict lookup:

wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}

and create the regex for the primer sequence:

primer_pattern = re.compile("".join(wildcards[c] for c in primer))

Would that work?


Thank you for your response, MRAB.

That's a rather clever way to do this sort of matching, but I actually
forgot one other crucial thing in my problem description (and I'm
hitting myself on the head for forgetting it!) - I need to know at
what position in my gene the primer was found.

If you use the primer_pattern.search() method (which searches starting at all offsets) instead of .match() (which only searches from the beginning), it should return a match object that has a .start() method to let you know the offset:

  m = primer_pattern.search(my_data)
  if m is None:
    print "Not found"
  else:
    print "Found at %i" % m.start()

-tkc


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to