On 13/11/2010 18:42, Zachary Brooks wrote:
Hello,

I'm taking a PhD course that requires the use of Perl and pattern matching.
I've taken on the motto "divide and conquer," but it hasn't quite worked. I
appreciate anyone's help.

The task is to extract sentences from a relatively large text file (928K,
ca. 300 pages). But of course, the text file is messy. I've tried two
approaches.

1. My first approach was to use substitute to get rid of a range of things
between<DOC>  and</DATELINE>. A short version looks like this.

$hello = "<DOC>  man at the bar order the</DATELINE>";
$hello =~ s/<DOC>.*<\/DATELINE>//gi;
print "$hello\n";

This works until the code comes across a quotation mark ("). So then I
replace double quotation marks (") with single quotation marks ('). But then
as, I put more text under $hello, the code seems to break. For example,
running the substitution against the code below simply re-shows me the code.

<DOC>  <DOCNO>  WSJ890728-0079</DOCNO>
<DD>  = 890728</DD>
<AN>  890728-0079.</AN>
<HL>  Major Deficit
@  Signaled by Sun</DATELINE>


It doesn't remove everything between<DOC>  and</DATELINE>.


2. My second approach was to simply find what I wanted using matching and
ignoring deleting. A short version looks like this.

$mystring = "<TEXT>     Sun Microsystems Inc. said it will post a
larger-than-expected fourth-quarter loss of as much as $26 million and may
show a loss in the current first quarter, raising further troubling
questions about the once high-flying computer workstation maker.</TEXT>
.";

if($mystring =~ m/<TEXT>(.*?)<\/TEXT>/) {
  print $1;
}


This works. Again I change the double quotation marks for the single
quotation marks. But once again when I include more data with line breaks,
the code breaks. This is the first part of a 5-part question. Very
frustrating. Every university should have Perl tutors just as they have (or
should have) language tutors.

Hi Zach

First of all, it seems that you are processing XML files. The best way by far to achieve this is to use one of the Perl XML libraries such as XML:LibXML or XML:Twig. However, if the changes you are making are minimal and simple then it is possible that an approach using regex substitutions is valid.

Also note that if you delete everything between and including a <DOC> tag and a </DATELINE> tag then what you would be left with is not valid XML.

The code you have shown is failing because /./ matches all characters except "\n". Employing the /s modifier changes this behaviour to match any character at all, so

  $hello =~ s/<DOC>.*?<\/DATELINE>//gis;

and

  $mystring =~ m/<TEXT>(.*?)<\/TEXT>/s;

will have the effect you intend.

I think your second approach is more likely to provide the better solution, but it is difficult to judge without knowing more about your data and the requirement.

Remember that, once you are reading data from real XML files, your problem with double quotes will disappear: it applies only when you are putting string data into your program for test purposes. Also, remember the q() and qq() operators to avoid problems with the string delimiter appearing within the string. Look for "Quote and Quote-like Operators" in

  perldoc perlop

HTH,

- Rob


--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to