On 13/11/2010 18:42, Zachary Brooks wrote:
Hello,
I'm taking a PhD course that requires the use of Perl and pattern matching.
I've taken on the motto "divide and conquer," but it hasn't quite worked. I
appreciate anyone's help.
The task is to extract sentences from a relatively large text file (928K,
ca. 300 pages). But of course, the text file is messy. I've tried two
approaches.
1. My first approach was to use substitute to get rid of a range of things
between<DOC> and</DATELINE>. A short version looks like this.
$hello = "<DOC> man at the bar order the</DATELINE>";
$hello =~ s/<DOC>.*<\/DATELINE>//gi;
print "$hello\n";
This works until the code comes across a quotation mark ("). So then I
replace double quotation marks (") with single quotation marks ('). But then
as, I put more text under $hello, the code seems to break. For example,
running the substitution against the code below simply re-shows me the code.
<DOC> <DOCNO> WSJ890728-0079</DOCNO>
<DD> = 890728</DD>
<AN> 890728-0079.</AN>
<HL> Major Deficit
@ Signaled by Sun</DATELINE>
It doesn't remove everything between<DOC> and</DATELINE>.
2. My second approach was to simply find what I wanted using matching and
ignoring deleting. A short version looks like this.
$mystring = "<TEXT> Sun Microsystems Inc. said it will post a
larger-than-expected fourth-quarter loss of as much as $26 million and may
show a loss in the current first quarter, raising further troubling
questions about the once high-flying computer workstation maker.</TEXT>
.";
if($mystring =~ m/<TEXT>(.*?)<\/TEXT>/) {
print $1;
}
This works. Again I change the double quotation marks for the single
quotation marks. But once again when I include more data with line breaks,
the code breaks. This is the first part of a 5-part question. Very
frustrating. Every university should have Perl tutors just as they have (or
should have) language tutors.
Hi Zach
First of all, it seems that you are processing XML files. The best way
by far to achieve this is to use one of the Perl XML libraries such as
XML:LibXML or XML:Twig. However, if the changes you are making are
minimal and simple then it is possible that an approach using regex
substitutions is valid.
Also note that if you delete everything between and including a <DOC>
tag and a </DATELINE> tag then what you would be left with is not valid XML.
The code you have shown is failing because /./ matches all characters
except "\n". Employing the /s modifier changes this behaviour to match
any character at all, so
$hello =~ s/<DOC>.*?<\/DATELINE>//gis;
and
$mystring =~ m/<TEXT>(.*?)<\/TEXT>/s;
will have the effect you intend.
I think your second approach is more likely to provide the better
solution, but it is difficult to judge without knowing more about your
data and the requirement.
Remember that, once you are reading data from real XML files, your
problem with double quotes will disappear: it applies only when you are
putting string data into your program for test purposes. Also, remember
the q() and qq() operators to avoid problems with the string delimiter
appearing within the string. Look for "Quote and Quote-like Operators" in
perldoc perlop
HTH,
- Rob
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/