On Oct 5, 3:38 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > On 05/10/2010 20:03, chaoticcran...@gmail.com wrote: > > > > > So, I have a rather tricky string comparison problem: I want to search > > for a set pattern in a variable source. > > > To give you the context, I am searching for set primer sequences > > within a variable gene sequence. In addition to the non-degenerate A/G/ > > C/T, the gene sequence could have degenerate bases that could encode > > for more than one base (for example, R means A or G, N means A or G or > > C or T). One brute force way to do it would be to generate every > > single non-degenerate sequence the degenerate sequence could mean and > > do my comparison with all of those, but that would of course be very > > space and time inefficient. > > > For the sake of simplicity, let's say I replace each degenerate base > > with a single wildcard character "?". We can do this because there are > > so many more non-degenerate bases that the probability of a degenerate > > mismatch is low if the nondegenerates in a primer match up. > > > So, my goal is to search for a small, set pattern (the primer) inside > > a large source with single wildcard characters (my degenerate gene). > > > The first thing that comes to my mind are regular expressions, but I'm > > rather n00bish when it comes to using them and I've only been able to > > find help online where the smaller search pattern has wildcards and > > the source is constant, such as here: > >http://www.velocityreviews.com/forums/t337057-efficient-string-lookup... > > > Of course, that's the reverse of my situation and the proposed > > solutions there won't work for me. So, could you help me out, oh great > > Python masters? *bows* > > Stand back, I'm going to try regex. :-) > > Both "A" and "R" in the variable sequence should match "A" in the > primer sequence, so "A" in the primer sequence should be replaced by > the character set "[AR]". The other bases should be replaced similarly. > > Use a simple dict lookup: > > wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"} > > and create the regex for the primer sequence: > > primer_pattern = re.compile("".join(wildcards[c] for c in primer)) > > Would that work?
Thank you for your response, MRAB. That's a rather clever way to do this sort of matching, but I actually forgot one other crucial thing in my problem description (and I'm hitting myself on the head for forgetting it!) - I need to know at what position in my gene the primer was found. As far as I know (and I'm a regex n00b, so please tell me if I'm wrong), you can't use string's find() on a regex and regex's match() does not return a position in the regex. I understand there are elements of in regular expressions that expand to variable numbers of characters so a "position number" in a regular expression is often a meaningless concept. Here, however, my regular expression has a 1 to 1 correspondence since each degenerate base should occupy only one wildcard slot. In this particular case, a position number is meaningful AND I need to know it for my program. Now. . .is there anything we can do about that? -- http://mail.python.org/mailman/listinfo/python-list