This script is exelent but I need the script to read the letters "åäö"
and "ÅÄÖ" too.
Cuz this is part of my launguage (Swedish) and those letters are in the
articles.
And I need to have the word between the <HIT></HIT> tags in too.
Finaly I how do I enclose the article with <ARTICLES></ARTICELS> cuz
Chas has right there, it can be more than one article in the file and it
is not enclosed by any outher tags.



Thanks for good support and help.




Chas Owens wrote:
> 
> Please, please, please, do not try to parse XML with regexps.  They only
> work in the simplest cases.  There are perfectly good XML modules
> designed to parse XML for you and they are not that hard to use.
> 
> The following code parses an XML file similar to the one you described,
> but has an additional tag (<ARTICLES></ARTICLES>) since XML must have
> one and only one root tag.  I added this tag because I thought you have
> more than one article per file.  If this is true then the XML you
> described is not well formed.  However it would be a simple process to
> wrap this tag around the file before attempting to parse it.  If there
> is in fact only one article per file then remove the outer foreach and
> replace $articles->children with $xmlobj->children.
> 
> <code>
> #!/usr/bin/perl -w
> 
> use strict;
> use XML::Parser;       #parse XML into an internal format
> use XML::SimpleObject; #easy to use forntend to XML::Parse
> 
> if (@ARGV != 2) { die "Usage: $0 news.xml index.html" }
> 
> my $parser = new XML::Parser (ErrorContext => 2, Style => "Tree");
> my $xmlobj = new XML::SimpleObject ($parser->parsefile($ARGV[0]));
> 
> open HTML, ">$ARGV[1]" or die "Could not open $ARGV[1]:$!";
> select HTML;
> 
> print "
> <html>
> <head>
> <title>
> News Articles for " . localtime() .  "
> </title>
> </head>
> <body>
> <table>";
> 
> foreach my $articles ($xmlobj->children) { #get the top tag
>    foreach my $article ($articles->children) { #get all articles
>       my $file = $article->child('PUB')->value . '-' .
>                  $article->child('RUB')->value . '-' .
>                  $article->child('LEV')->value . '-' .
>                  $article->child('DAT')->value;
>       $file =~ s/[^\w.-]//g; #remove anything not alphanumeric, _, -, or
> .
>       open FH, ">$file" or die "Could not open $file:$!";
>       print FH $article->child('BRO')->value;
>       close FH;
>       print
> "<tr><td>", $article->child('ORD')->value, "</td></tr>\n",
> "<tr><td>", $article->child('LEV')->value, "</td></tr>\n",
> "<tr><td>", $article->child('DAT')->value, "</td></tr>\n",
> "<tr><td>", $article->child('PUB')->value, "</td></tr>\n",
> "<tr><td><a href=\"$file\">", $article->child('RUB')->value,
> "</a></td></tr>\n","<tr><td>", $article->child('INL')->value,
> "</td></tr>\n",
> "<tr><td></td></tr>";
>    }
> }
> 
> print "
> </table>
> </body>
> </html>";
> 
> close HTML;
> </code>
> 
> On 19 Jun 2001 13:34:03 +0100, Nigel Wetters wrote:
> > I think I can give you some clues. Here's some code out of the Perl Cookbook (6.8 
>Extracting a Range of Lines), which I've adapted for you. You should be able to nest 
>such structures to get what you want.
> >
> > my $extracted_lines = '';
> > while (<>) {
> >     if (/BEGIN PATTERN/ .. /END PATTERN/) {
> >         # line falls between BEGIN and END in the
> >         # text, inclusive
> >         $extracted_lines .= $_;
> >     } else {
> >         # now, we're outside the pattern
> >         process($extracted_lines) if $extracted_lines;
> >         $extracted_lines = '';
> >     }
> > }
> > sub process
> > {
> >     # do stuff with the extracted lines
> >     # maybe performing more regex's
> > }
> >
> > >>> Morgan <[EMAIL PROTECTED]> 06/19/01 01:12pm >>>
> > Hi
> >
> > I'm newbee perl developer and a rookie of xml :(
> >
> > Is there anyone who can give me some hints or help me out with a problem
> > I have?
> >
> > Here is the problem.
> > I will recive newsarticles three times a day in xml format and I need to
> > automaticly publish those articels on a web page, on the first page it
> > should only show the tags down to </INL>
> > tag and a link to the whole page.
> >
> > Here is a sample of the xml format.
> >
> > <ART>
> > <ORD>anbud</ORD>
> > <LEV>2001-06-14</LEV>
> > <DAT>14-06-01</DAT>
> > <PUB>DAGENS INDUSTRI</PUB>
> > <RUB>Dragkamp om förlusttåg</RUB>
> > <INL>Here is the indroduction about the article and when the word
> > anbud comes up it is enclosed in <HIT>anbud</HIT> tags.
> > This is the word we use as criteria on the articels we should recive.
> > </INL>
> > <BRO>
> > Here comes the rest of the document, thats the whole article.
> > Even her is the <HIT>anbud</HIT> and I need the word between <HIT> cuz thats part 
>of the     article.
    The article ends with
> > </BRO>
> > </ART>
> >
> >
> > Raven
> >
> >
> >
> > This e-mail and any files transmitted with it are confidential
> > and solely for the use of the intended recipient.
> > ONdigital plc, 346 Queenstown Road, London SW8 4DG. Reg No: 3302715.
> >
> --
> Today is Setting Orange, the 24th day of Confusion in the YOLD 3167
> Wibble.

Reply via email to