Re: Multiple matching of a group of characters

William Muriithi Tue, 02 Oct 2012 16:18:34 -0700

Florian,
>
> The string is:
>
>>ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG


It may actually have helped if you posted two or three samples.  This
could help us identify patterns in your data and hence advice on the
necessary regular expression for process your data
>
> So I'm trying to retrieve'ENSG00000112365', 'ENST00000230122' and the
> sequence bit, starting with a 'T' and get rid of the junk in between.

There is a lot of T is the above gene sequence, not sure which one you
refers to when you say "starting with 'T'"

>
> code:
>
> /#!/usr/bin/perl//
> //
> //use strict;//
> //use warnings;//
> //
> //my $gene;//
> //my @elements = <>;//
> //
> //foreach $gene (@elements) {//
> //    $gene =~ />(ENSG\d*) \| (ENST\d*) .*? ([AGCT]*)/x;//
Try

$gene =~ />(ENSG\d*) \| (ENST\d*) .*? (AGCT\z)/x;//

Assume you need everything starting from AGCT to the end of the sequence
> //    print "$1 $2 $3\n";//
> //}/
>
>
> This will print "ENSG00000112365 ENST00000230122"
>
> without the sequence. Originally I had .* before the ([ACGT]) so I figured
> it's greedy and will eat the sequence away. ? makes it nongreedy, doesn't
> it? Still doesn't work.
>
Greed here don't mean eating, its how wide it try matching.  Try
google as there is better explanation out there
> Other results:
>
> with ([AGCT])* it says that $3 is uninitialised - so here it didn't match at
> all???
>
> with ([AGCT]{5}) it works fine - it returns TGTTT.
>
>
> This I found kinda strange - looks like I've got something with the
> greediness/precedence wrong?
>
>
> Thank you for your help!
>
> Flo
>
>
> On 02/10/2012 01:36, Brandon McCaig wrote:
>>
>> On Mon, Oct 01, 2012 at 11:15:53PM +0100, Florian Huber wrote:
>>>
>>> Dear all,
>>
>> Hello,
>>
>>> $string = "/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/"
>>
>> I would suggest that you show us the real data. I'm assuming that
>> 'NOTNEEDED' is a placeholder for some data that you're not
>> interested in. Without knowing what that is we can't really say
>> for sure what is going on (though we can speculate; see below).
>>
>> Note that you should be using the strict and warnings pragmas
>> (see below). The lack of 'my' here suggests that you probably
>> aren't.
>>
>>> But when I do
>>>
>>> $string =~ /[ACGT]/;
>>>
>>> it matches only the last letter, i.e. "G". Why doesn't it start
>>> at the beginning?
>>
>> It isn't matching the last letter. You are probably making the
>> wrong assumption. This is common when you're having trouble with
>> code. Again, show us the 'NOTNEEDED' part. :)
>>
>>> But it gets even better, I figured that adding the greedy *
>>> should help:
>>>
>>> $string =~ /[ACGT]*/;
>>>
>>> and now it doesn't match anything. Shouldn't it try to match as
>>> many times as possible?
>>
>> It should match at least the once that you saw earlier (assuming
>> the same data).
>>
>>> My confusion was complete when I tried
>>>
>>> $string =~ /[ACGT]{5}/;
>>>
>>> now it matches 5 letters, but this time from the beginning,
>>> i.e.: ACGAC.
>>
>> I'm guessing that the first 'NOTNEEDED' contains a 'G'. That
>> would explain the first match. The second result is nonesense
>> with the data we've seen. :-/ If 'NOTNEEDED' doesn't contain a
>> string at least 5 characters in length composed only of 'A', 'C',
>> 'G', or 'T' then that would explain this last result.
>>
>>> I fail to understand that behaviour. I checked the Perl
>>> documentation a bit and I sort of understand why /[ACGT]/ only
>>> matches one letter only (but not why it starts at the end).
>>> However, I'm simply puzzled at the other things.
>>
>> As said, provide us with a full (minimal) program to demonstrate
>> the problems you're having if your problems persist.
>>
>> Assuming 'NOTNEEDED' cannot contain '/' characters then you may
>> need to include those in your pattern to make sure you match the
>> parts you want. You will probably want to use captures for that
>> (see perldoc perlre). To understand the below program you will
>> also need to understand the /x modifier (again see perldoc
>> perlre).
>>
>> #!/usr/bin/perl
>>
>> use strict;   # <---Make sure you have these.
>> use warnings; # <--/
>>
>> my $string = '/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/';
>>
>> my ($match) = $string =~ m,
>>          ^         # Beginning of string.
>>          /         # Skip over the first '/'.
>>          [^/]*     # Skip over anything that's not a '/'.
>>          /         # Until the next '/'. Skip over that too.
>>          \*        # Skip over the literal '*' character.
>>          ([ACGT]+) # Now capture the sequence we want.
>>          ,x;
>>
>> print $match, "\n";
>>
>> __END__
>>
>> Output:
>>
>> ACGACGGGTTCAAGGCAG
>>
>> IF the '*' characters literally delimit the parts that you want
>> (AND not the parts that you don't want) then that's even easier:
>>
>> #!/usr/bin/perl
>>
>> use strict;
>> use warnings;
>>
>> my $string = '/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/';
>>
>> my ($match) = $string =~ /\*([ACGT]+)/;
>>
>> print $match, "\n";
>>
>> __END__
>>
>> This produces the same output with this sample string. Without
>> seeing the real data it's hard to speculate. There might be a
>> better way. You need to know the specifications of the data
>> you're processing if you want to reliably process it
>> automatically. We need to know this to help you do it too.
>>
>>                                o o o o
>>
>> A lot of people seem to post about this same type of data. I'd be
>> surprised if nobody has written CPAN modules for parsing the data
>> yet (and if not then perhaps it would be economical to do so).
>> Just saying...
>>
>> Regards,
>>
>>
>

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: Multiple matching of a group of characters

Reply via email to