On Thu, 29 Mar 2001 00:37:38 -0800, Nathan Torkington wrote:
> Greg McCarroll wrote:
> > sure it makes sense, but it still is CiP and trust me this isn't
> > the only bit of CiP in here and much kudos to Paul for it ;-)
>
> I'm unsure what CiP is, but if it has anything to do with gnarliness,
> I know that Paul wrote a >1k regexp to parse XML correctly. It only
> fails one test from a real XML parsing package, and he tracked that
> to the limitations of the new RE stuff in 5.6.0.
Here's what I use, which probably isn't what most people would think of when they hear
"XML parser", but it does let me extract the bits I generally want. I wouldn't give
this a CiP rating, but then I know how it works.
# xml_parse($xml,$tag)
# Return the contents of any <$tag>s appearing in $xml
# Returns an arrayref of hashrefs of attributes, content in {__content__}
# If $tag eq '*', returns all tags, element names in {__element__}
# This is pretty simple, and assumes the following:
# attributes match \w+
# all attributes have double-quoted values
# there are no <CDATA[[]]> sections
# end tags don't have attributes
# $xml is otherwise well-formed
sub xml_parse{
my($xml,$_tag)=@_;
my $tag=$_tag eq '*' ? '[\w:-]+' : "\Q$_tag\E";
# Remove comments
$xml=~s/<!--.*?-->//gs;
# Extract tags
my @tags;
pos $xml=0; # Reset /g position
while($xml=~m#\G.*?<($tag)\b#gs){
my $tag=$1;
my %tag;
$tag{__element__}=$tag if $_tag eq '*';
while($xml=~m#\G\s+(\w+)="([^"]*)"#gc){
$tag{$1}=$2;
}
if($xml=~m#\G\s*/>#gc){
# There's no content
}else{
# Get the content
$xml=~m#\G\s*>#gc or next; # Next means not well formed
my $level=1;
while(1){
if($xml=~m#\G((?:(?!</?\s*\Q$tag\E\b).)+)#gcs){
# More content
$tag{__content__}.=$1;
}elsif($xml=~m#\G(</\s*\Q$tag\E\s*>)#gc){
# End tag
if(--$level){
$tag{__content__}.=$1;
}else{
last;
}
}elsif($xml=~m#\G(<\s*\Q$tag\E\b([^>]*)>)#gc){
# Start tag
$tag{__content__}.=$1;
++$level unless (my $tmp=$2)=~m#/\s*$#;
}else{
# We must have reached the end of the string, so it's not well formed
last;
}
}
}
push @tags,\%tag;
}
\@tags;
}
--
Peter Haworth [EMAIL PROTECTED]
"I think there's a problem with the server power supply"
"Why?"
"There were flames coming out of the cooling fan until it stopped."