Re: xml problem

Chas Owens Thu, 21 Jun 2001 03:54:11 -0700
On 21 Jun 2001 10:38:08 +0200, Morgan wrote:
> This script is exelent but I need the script to read the letters "åäö"
> and "ÅÄÖ" too.
> Cuz this is part of my launguage (Swedish) and those letters are in the
> articles.
I am working on this, I don't understand what it is doing with them.  If
I add <?xml version='1.0' encoding='ISO-8859-1'?> to the start of the
file the parser doesn't bomb any more, but it outputs "Dragkamp om
fÃ¶rlusttÃ¥g" instead of "Dragkamp om förlusttåg".  My current
assumption is that it is doing something funky involving two-byte
UNICODE chars (this is based on the observation that the funky chars
appear to be doubled).  If anyone could shed some light on this I would
be grateful.  I am at the point where I am going to use Merlyn's
wonderful regexp from elsewhere in this thread to change these
characters into <ftdp ord="the ord value of char" /> and then use a
regexp to change them back before printing to the files.  For the
curious, ftdp stands for "fsck the durn parser" here.  

<joke>
Of course, being an American my first thought is

$text =~ tr/ÅÄÖåäö/AAOaao/g; #hope I got this right I don't use tr often

If I can figure out what resume (which is supposed to have two e acutes
in one case) is (either restart a process or somebodies work experience)
based on context, you foriegners can too.  All these squiggles, dots,
etc. are just sugar.
</joke>

> And I need to have the word between the <HIT></HIT> tags in too.
The xml file has a tree like structure that looks like this

<ARTICLES>
<ART>
<ORD>this is in ord</ORD>
<LEV>this is in lev</LEV>
<DAT>this is in dat</DAT>
<PUB>this is in pub</PUB>
<RUB>this is in rub</RUB>
<INL>
INL text that has a <HIT>hit value</HIT> in it.
</INL>
<BRO>this is in bro</BRO>
</ART>
</ARTICLES>

So we can access <HIT> the same way we access the other tags:

#the first child returns an XML::SimpleObject for the INL tag
#the second child is called from the returning object and returns
#   an XML::SimpleObject for the HIT tag
#and finaly the value is called from that returning object and returns
#   the contents of that tag ("hit value" in this case).
$article->child('INL')->child('HIT')->value

> Finaly I how do I enclose the article with <ARTICLES></ARTICELS> cuz
> Chas has right there, it can be more than one article in the file and it
> is not enclosed by any outher tags.
Well, there are several ways.  The best way is to get the entity that is
sending you this file to write well-formed XML.  This would mean adding
the header <?xml version='1.0' encoding='ISO-8859-1'?>, a root level tag
(in my example I used <ARTICLES></ARTICLES>), and maybe the DTD this XML
is written for.  This may not be an option, luckily TIMTOWTDI.  If the
file is small you could do this:

my $xmlstr;
{
local ($/) = undef; #protect the rest of the program by making
                   #this local
open FH, $filename;
$xmlstr = "<?xml version='1.0' encoding='ISO-8859-1'?>\n",
         "<ARTICLES>\n",
                  <FH>,
                 "</ARTICLES>";
close FH;
} # $/ goes out of scope and returns to normal

and then change the call if parsefile() to simply parse().

There are other ways, but since you are going to have the entire
structure in memory anyways after you parse the data I don't see a
problem with this method for this application.  

If the file were trully large you would not want to use
XML::SimpleObject.  In that case you would want to use one of the stream
based parsers.

> 
> 
> 
> Thanks for good support and help.
> 
> 
<snip />
--
Today is Boomtime, the 26th day of Confusion in the YOLD 3167
Umlaut Zebra über alles!
Re: xml problem

Reply via email to