Let's try this one more time...  I originally sent this reply last Tuesday,
but it seems a problem with the message headers kept it from being
delivered.


On Tue, Mar 02, 2004 at 08:01:07AM -0500, Aaron J. Mackey wrote:
> 
> On the BioPerl mailing list we often get requests like the following:
> 
> Within a given biosequence with length X, find substrings of min. 
> length A and max. length B that contain the pattern P at least C times 
> but no more than D times.
> 
> A more concrete example: Find all substrings 12 characters long (A = B 
> = 12) that have at least 7 (C = 7, D = 12 implictly) 'I' or 'L' 
> characters (P = [IL]) in it.
> 
> The naive approach is a "sliding window" method, but it seems to me 
> that a pattern matching approach would be more efficient.  And it 
> sounds like a great little challenge for the brilliant minds of FWP.  
> The "best" version will find it's way into a BioPerl module (with 
> appropriate attribution, of course).  Golfing is not the goal here (but 
> Golf-ed solutions are still welcome, if you must).

I think I've got almost what you want.  It correctly limits the length and
number of repetitions (example 1) and handles overlapping occurrences of
the pattern (example 2).

The one flaw is that it will only match one possible substring starting at
any point in the string.  For example, if you have S = ABAB, P = [AB], A =
2, B = 4, C = 2, D = 4, you could match ABAB, ABA, and AB all at the start
of the string, but my regex will only match ABAB (example 3).  Perl won't
match a 0-length substring more than once at the same position.  I'm not
aware of a simple way around that within the regex.


#!/usr/local/bin/perl -w

use strict;
use re 'eval';

my @args = (
            [qw/ ABCBAB [AB] 3 3 2 2 /],
            [qw/ 12121212 1212 8 8 3 3 /],
            [qw/ ABAB [AB] 2 4 2 4 /],
           );

foreach my $args (@args) {
  print "@$args\n";
  my @matches = subpattern(@$args);
  foreach my $match (@matches) {
    print "@$match ",
          substr($args->[0], $match->[0], $match->[1] - $match->[0]),
          "\n";
  }
  print "\n";
}


sub subpattern {
  my($string, $pattern, $minlen, $maxlen, $minrep, $maxrep) = @_;

  my $not_pattern = "(?:(?!$pattern).)";
  my $match_length =
    "(??{(length\$1>=$minlen&&length\$1<=$maxlen)?'(?=)':'(?!)'})";

  my @matches;

  my $re = "(?=($not_pattern*" .
                "((?=$pattern).$not_pattern*){$minrep,$maxrep}" .
           ")$match_length)";

  warn "$re\n";

  while ($string =~ /$re/g) {
    push @matches, [$-[0], $-[0] + length $1];
  }

  return @matches;
}

__END__


Reply via email to