Re: utf8, japanese, web-pages, the horror, the horror...

Marco Baroni Tue, 11 May 2004 01:13:53 -0700

Thanks!

I will try the solution you propose, and I will let y'all know whether it 
works.


In the meantime, I had ``solved'' the problem by saving pages with
different charset=... declarations to different output files (ofile.sjis,
ofile.euc, etc.), and then using recode to convert everything to the same
charset.

Unfortunately, this (moving the encoding processing outside perl)  seems
to be what I always end up doing, when I have to deal with characters
outside the latin1 range...

As you said, from_to isn't a natural interface, at least for me!

Regards,

Marco
 
> In my opinion Encode's from_to isn't a natural interface.
> (With from_to neither the original nor the result is in a form 
> in which you can use perl's character semantics.)
> 
> It is much better IMHO to use ->decode directly.
> 
> That is use 'decode' to convert (based on 'charset=' in this case) 
> whatever encoding source is in to Unicode. Then write Unicode using 
> binmode :utf8 or :encoding() of your choice.
> 
> If you must use from_to() then appropriate target for a :utf8 stream
> is to get characters into internal Unicode form:
> 
>    from_to($text, $charset, 'Unicode') 
> 
> I would prefer to use 
> 
>    binmode STDOOUT,":utf8";
>    my $encoding = find_encoding($charset);
>    my $unicode = $encoding->decode($text);
>    print $unicode;
> 

--
Marco Baroni
SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni

Re: utf8, japanese, web-pages, the horror, the horror...

Reply via email to