On 20 Mar 2007 at 12:55, Chas Owens wrote:

> On 3/20/07, Beginner <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I have a large, 1.3GB xml file that I was trying to validate. It
> > turns out that the file has a lot of exotic characters in it such as:
> > é
> > è
> > Ä
> > È
> > ...etc

> > Being a lazy kidda guy, I though I would cat the file and let perl
> > make the substitiuations where it found any of these characters. My
> > problem is I am not sure how to regex for these characters except to
> > look for the hex value. Neither do I know of a way to escape/encode
> > them correctly.
> >
> > I have seen the pragma utf8 but I am not sure my problem is what this
> > pragma was designed for. Does anyone have any suggestions for a
> > module or method that might take some of the pain out of detecting
> > and escaping such characters?

>
> Be a truly lazy guy and use iconv.


The hex idea might work. If I can locate all the characters in a
file/string that use a hex in a range and the substitute them.
Perhaps something like this would do it (feel free to correct me if I
am wrong)

s/\xc9/'&#'.$1.';'/                     # Hoping for &#201; from É

However it doesn't feel like it's the best approach.

The Iconv route hasn't been too successful either. I tried
Text::Iconv->new('ISO8859-1','utf8');

Thinking that my data is currently ISO8859-1but the results were not
as I had hoped. Where I had MICROSCÓPIO, I got MICROSCÃPIO.

If I can't convert, perhaps I need to XHTML escape them so
MICROSCÓPIO would become MICROSC&Otilde;PIO

Is there are module that can parse and substitute in this way or do I
need to roll my own?

TIA,
Dp.



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to