On Thu, 29 Mar 2001 00:37:38 -0800, Nathan Torkington wrote:
> Greg McCarroll wrote:
> > sure it makes sense, but it still is CiP and trust me this isn't
> > the only bit of CiP in here and much kudos to Paul for it ;-)
> 
> I'm unsure what CiP is, but if it has anything to do with gnarliness,
> I know that Paul wrote a >1k regexp to parse XML correctly.  It only
> fails one test from a real XML parsing package, and he tracked that
> to the limitations of the new RE stuff in 5.6.0.

Here's what I use, which probably isn't what most people would think of when they hear 
"XML parser", but it does let me extract the bits I generally want. I wouldn't give 
this a CiP rating, but then I know how it works.

# xml_parse($xml,$tag)
# Return the contents of any <$tag>s appearing in $xml
# Returns an arrayref of hashrefs of attributes, content in {__content__}
# If $tag eq '*', returns all tags, element names in {__element__}
# This is pretty simple, and assumes the following:
#        attributes match \w+
#        all attributes have double-quoted values
#        there are no <CDATA[[]]> sections
#        end tags don't have attributes
#        $xml is otherwise well-formed
sub xml_parse{
  my($xml,$_tag)=@_;
  my $tag=$_tag eq '*' ? '[\w:-]+' : "\Q$_tag\E";

  # Remove comments
  $xml=~s/<!--.*?-->//gs;

  # Extract tags
  my @tags;
  pos $xml=0; # Reset /g position
  while($xml=~m#\G.*?<($tag)\b#gs){
    my $tag=$1;
    my %tag;
    $tag{__element__}=$tag if $_tag eq '*';

    while($xml=~m#\G\s+(\w+)="([^"]*)"#gc){
      $tag{$1}=$2;
    }
    if($xml=~m#\G\s*/>#gc){
      # There's no content
    }else{
      # Get the content
      $xml=~m#\G\s*>#gc or next; # Next means not well formed
      my $level=1;
      while(1){
        if($xml=~m#\G((?:(?!</?\s*\Q$tag\E\b).)+)#gcs){
          # More content
          $tag{__content__}.=$1;
        }elsif($xml=~m#\G(</\s*\Q$tag\E\s*>)#gc){
          # End tag
          if(--$level){
            $tag{__content__}.=$1;
          }else{
            last;
          }
        }elsif($xml=~m#\G(<\s*\Q$tag\E\b([^>]*)>)#gc){
          # Start tag
          $tag{__content__}.=$1;
          ++$level unless (my $tmp=$2)=~m#/\s*$#;
        }else{
          # We must have reached the end of the string, so it's not well formed
          last;
        }
      }
    }
    push @tags,\%tag;
  }

  \@tags;
}


-- 
        Peter Haworth   [EMAIL PROTECTED]
 "I think there's a problem with the server power supply"
 "Why?"
 "There were flames coming out of the cooling fan until it stopped."

Reply via email to