Re: Multiple matching of a group of characters

Brandon McCaig Mon, 01 Oct 2012 17:37:34 -0700

On Mon, Oct 01, 2012 at 11:15:53PM +0100, Florian Huber wrote:
> Dear all,


Hello,

> $string = "/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/"

I would suggest that you show us the real data. I'm assuming that
'NOTNEEDED' is a placeholder for some data that you're not
interested in. Without knowing what that is we can't really say
for sure what is going on (though we can speculate; see below).

Note that you should be using the strict and warnings pragmas
(see below). The lack of 'my' here suggests that you probably
aren't.

> But when I do
> 
> $string =~ /[ACGT]/;
> 
> it matches only the last letter, i.e. "G". Why doesn't it start
> at the beginning?

It isn't matching the last letter. You are probably making the
wrong assumption. This is common when you're having trouble with
code. Again, show us the 'NOTNEEDED' part. :)

> But it gets even better, I figured that adding the greedy *
> should help:
> 
> $string =~ /[ACGT]*/;
> 
> and now it doesn't match anything. Shouldn't it try to match as
> many times as possible?

It should match at least the once that you saw earlier (assuming
the same data).

> My confusion was complete when I tried
> 
> $string =~ /[ACGT]{5}/;
> 
> now it matches 5 letters, but this time from the beginning,
> i.e.: ACGAC.

I'm guessing that the first 'NOTNEEDED' contains a 'G'. That
would explain the first match. The second result is nonesense
with the data we've seen. :-/ If 'NOTNEEDED' doesn't contain a
string at least 5 characters in length composed only of 'A', 'C',
'G', or 'T' then that would explain this last result.

> I fail to understand that behaviour. I checked the Perl
> documentation a bit and I sort of understand why /[ACGT]/ only
> matches one letter only (but not why it starts at the end).
> However, I'm simply puzzled at the other things.

As said, provide us with a full (minimal) program to demonstrate
the problems you're having if your problems persist.

Assuming 'NOTNEEDED' cannot contain '/' characters then you may
need to include those in your pattern to make sure you match the
parts you want. You will probably want to use captures for that
(see perldoc perlre). To understand the below program you will
also need to understand the /x modifier (again see perldoc
perlre).

#!/usr/bin/perl

use strict;   # <---Make sure you have these.
use warnings; # <--/

my $string = '/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/';

my ($match) = $string =~ m,
        ^         # Beginning of string.
        /         # Skip over the first '/'.
        [^/]*     # Skip over anything that's not a '/'.
        /         # Until the next '/'. Skip over that too.
        \*        # Skip over the literal '*' character.
        ([ACGT]+) # Now capture the sequence we want.
        ,x;

print $match, "\n";

__END__

Output:

ACGACGGGTTCAAGGCAG

IF the '*' characters literally delimit the parts that you want
(AND not the parts that you don't want) then that's even easier:

#!/usr/bin/perl

use strict;
use warnings;

my $string = '/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/';

my ($match) = $string =~ /\*([ACGT]+)/;

print $match, "\n";

__END__

This produces the same output with this sample string. Without
seeing the real data it's hard to speculate. There might be a
better way. You need to know the specifications of the data
you're processing if you want to reliably process it
automatically. We need to know this to help you do it too.

                              o o o o

A lot of people seem to post about this same type of data. I'd be
surprised if nobody has written CPAN modules for parsing the data
yet (and if not then perhaps it would be economical to do so).
Just saying...

Regards,


-- 
Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org>
Castopulence Software <https://www.castopulence.org/>
Blog <http://www.bamccaig.com/>
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'

signature.asc
Description: Digital signature

Re: Multiple matching of a group of characters

Reply via email to