Re: regular expressions and newline characters

Michel Rodriguez Wed, 15 Aug 2001 04:35:24 -0700
On Wednesday 15 August 2001 13:20, Joe Bellifont wrote:
> I have a file that looks like this
> ====
>
> <FNAME>joe</FNAME>
> <SURNAME>bloggs</BLOGGS>
> <QDETAILS>  herein lies the question posed by the user
> the question can be multi-lined
> like this one.
> </QDETAILS>
>
> ======
>
> I'm trying to read the various tag content into variables:
> ==========
> sub ParseFile {
>       my $file = 'submission6.xml';
>
>
>       #opened the file i want to parse
>       open(FH, $file) || die "cannot open file: $!";
>       print "opening $file...........\n\n";
>       #read contents into array
>       my @stuff=<FH>;
>       close(FH);
>
>
>       #create one long string - why I'm not sure - but it worked with the regex
> below
>       foreach my $stuff(@stuff) {
>               $var=$var.$stuff;
>               }
>
>       my @details;
>       # this grabs the text between <FNAME> and </FNAME>
>       ($details[0])=$var=~/\<FNAME\>(.*)\<\/FNAME\>/;
>
>       # this grabs the text between <SURNAME> and </SURNAME>
>       ($details[1])=$var=~/\<SURNAME\>(.*)\<\/SURNAME\>/;
>
>       #I want this top grab all the text between <QDETAILS> and </QDETAILS>
> -newline characters included.
>       ($details[2])=$var=~/\<QDETAILS\>(.*)\<\/QDETAILS\>/;#
>       #PROBLEM IS HERE==================^^^
>       foreach $detail(@details) {
>       print "$detail\n";
>               }
>
>       }
>
> ==========
> the regex for FNAME and SURNAME work fine. But I can't grab the text
> between <QDETAILS> and </QDETAILS> because
> of newline characters I think.
>
> Any other tips on how to improve my code generally?

Hi,

OK, so please bear with me, I am going to sound like an XML ayatollah ;--( 

First you really should not call, or even imply (the file name ending with 
.xml), that you are using XML when you are not: apart from </BOGGS> instead 
of </SURNAME> which is obviously a typo your document  is _not_ well-formed 
XML: it misses a wrapping tag around the list of tags.

Then if you want to process XML, you should never, _never_, _NEVER_! do it 
with regular expressions (OK, maybe there are cases where you can use 
regexps, but they involve huge amounts of data, throw-away conversions and 
generally knowing exactly what you are doing and why). 

Use the parser luke!

There are just too many potential traps for you to write a robust XML::Parser 
with regexps. Especially as there is an existing parser, plus a host of XML 
modules that will make your life much easier.

In fact if you can install XML::Parser and XML::Simple on your system it will 
be dead easy for you to get the values of the fields in a hash:

#!/bin/perl -w
use strict;

use XML::Simple;                # depends on XML::Parser
use Data::Denter;                # just to check what's read in by XML::Simple

my $data= XMLin( \*DATA); # read the data, you would use "./$file"
print Denter( $data), "\n";     # just checking

# yes, it's that simple!
foreach my $field qw( FNAME SURNAME QDETAILS)
  { print "$field: $data->{$field}\n"; }

__DATA__
<doc>
<FNAME>joe</FNAME>
<SURNAME>bloggs</SURNAME>
<QDETAILS>  herein lies the question posed by the user
the question can be multi-lined
like this one.
</QDETAILS>
</doc>


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Re: regular expressions and newline characters

Reply via email to