Re: Perl, pattern matching, substitution

Rob Dixon Sat, 13 Nov 2010 13:50:39 -0800

On 13/11/2010 18:42, Zachary Brooks wrote:

Hello,


I'm taking a PhD course that requires the use of Perl and pattern matching.
I've taken on the motto "divide and conquer," but it hasn't quite worked. I
appreciate anyone's help.

The task is to extract sentences from a relatively large text file (928K,
ca. 300 pages). But of course, the text file is messy. I've tried two
approaches.

1. My first approach was to use substitute to get rid of a range of things
between<DOC>  and</DATELINE>. A short version looks like this.

$hello = "<DOC>  man at the bar order the</DATELINE>";
$hello =~ s/<DOC>.*<\/DATELINE>//gi;
print "$hello\n";

This works until the code comes across a quotation mark ("). So then I
replace double quotation marks (") with single quotation marks ('). But then
as, I put more text under $hello, the code seems to break. For example,
running the substitution against the code below simply re-shows me the code.

<DOC>  <DOCNO>  WSJ890728-0079</DOCNO>
<DD>  = 890728</DD>
<AN>  890728-0079.</AN>
<HL>  Major Deficit
@  Signaled by Sun</DATELINE>


It doesn't remove everything between<DOC>  and</DATELINE>.


2. My second approach was to simply find what I wanted using matching and
ignoring deleting. A short version looks like this.

$mystring = "<TEXT>     Sun Microsystems Inc. said it will post a
larger-than-expected fourth-quarter loss of as much as $26 million and may
show a loss in the current first quarter, raising further troubling
questions about the once high-flying computer workstation maker.</TEXT>
.";

if($mystring =~ m/<TEXT>(.*?)<\/TEXT>/) {
  print $1;
}


This works. Again I change the double quotation marks for the single
quotation marks. But once again when I include more data with line breaks,
the code breaks. This is the first part of a 5-part question. Very
frustrating. Every university should have Perl tutors just as they have (or
should have) language tutors.


Hi Zach

First of all, it seems that you are processing XML files. The best wayby far to achieve this is to use one of the Perl XML libraries such asXML:LibXML or XML:Twig. However, if the changes you are making areminimal and simple then it is possible that an approach using regexsubstitutions is valid.

Also note that if you delete everything between and including a <DOC>tag and a </DATELINE> tag then what you would be left with is not valid XML.

The code you have shown is failing because /./ matches all charactersexcept "\n". Employing the /s modifier changes this behaviour to matchany character at all, so


  $hello =~ s/<DOC>.*?<\/DATELINE>//gis;

and

  $mystring =~ m/<TEXT>(.*?)<\/TEXT>/s;

will have the effect you intend.

I think your second approach is more likely to provide the bettersolution, but it is difficult to judge without knowing more about yourdata and the requirement.

Remember that, once you are reading data from real XML files, yourproblem with double quotes will disappear: it applies only when you areputting string data into your program for test purposes. Also, rememberthe q() and qq() operators to avoid problems with the string delimiterappearing within the string. Look for "Quote and Quote-like Operators" in


  perldoc perlop

HTH,

- Rob


--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: Perl, pattern matching, substitution

Reply via email to