Re: utf8, japanese, web-pages, the horror, the horror...

Nick Ing-Simmons Tue, 11 May 2004 01:05:31 -0700

Marco Baroni <[EMAIL PROTECTED]> writes:
>Thanks for your advice... the output does look different, this time, 
>but it still doesn't look like utf8... (I get the same error with 
>recode).

>
>If somebody could suggest a way to convert to another encoding, or a 
>better way to identify the encoding of eac page, that would also be 
>fine (once I have control over the encodings, I think I can find some 
>way to convert back to utf8 (eg, via recode).

In my opinion Encode's from_to isn't a natural interface.
(With from_to neither the original nor the result is in a form 
in which you can use perl's character semantics.)

It is much better IMHO to use ->decode directly.

That is use 'decode' to convert (based on 'charset=' in this case) 
whatever encoding source is in to Unicode. Then write Unicode using 
binmode :utf8 or :encoding() of your choice.

If you must use from_to() then appropriate target for a :utf8 stream
is to get characters into internal Unicode form:

   from_to($text, $charset, 'Unicode') 

I would prefer to use 

   binmode STDOOUT,":utf8";
   my $encoding = find_encoding($charset);
   my $unicode = $encoding->decode($text);
   print $unicode;

>
>Thanks again,
>
>Marco
>
>On Saturday, May 8, 2004, at 05:16 Europe/Rome, Edward Batutis wrote:
>
>> Marco:
>>
>> I think you are converting twice:
>>
>>> # output will be utf8
>>> binmode(STDOUT, ":utf8");
>>> ...
>>>                 from_to($html_text,$charset,"utf8");
>>> ...
>>
>> Here, it will convert html_text to utf-8 again because of binmode with
>> utf-8:
>>
>>>                 print "CURRENT URL $url\n$html_text\n";
>>
>> I think you can just remove the binmode line and it will work.
>>
>>> Why do encodings always cause so much pain?
>>
>> I hope this helps today's pain, at least :-).
>>
>> Regards,
>>
>> =Ed
>>

Re: utf8, japanese, web-pages, the horror, the horror...

Reply via email to