php-i18n Digest 24 Feb 2004 09:58:23 -0000 Issue 216

Topics (messages 667 through 673):

Re: gettext and utf8
        667 by: walter fan
        668 by: Moriyoshi Koizumi
        670 by: walter fan

messages.po HTML encoding
        669 by: a.h.s. boy
        671 by: Moriyoshi Koizumi
        672 by: Moriyoshi Koizumi

mb_strcoll
        673 by: Brodie Thiesfield

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
Hi,Moriyoshi

Thanks for your information. There is a strange thing, I had modified it to 
charset=UTF-8, it can not display UTF-8 strings. But I modified it to charset=CHARSET, 
it's okay. My server is RedHat Linux7.2.

Btw, why the translation strings still display by gettext  after I deleted the po and 
mo files in locale folder?

Thanks & Regards,
Walter Fan
----- Original Message ----- 
From: "Moriyoshi Koizumi" <[EMAIL PROTECTED]>
To: "walter fan" <[EMAIL PROTECTED]>
Sent: Tuesday, February 10, 2004 3:37 AM
Subject: Re: [PHP-I18N] gettext and utf8


> On 2004/02/05, at 12:42, walter fan wrote:
> 
> > 2.I modified the charset of php files and messages.po to UTF-8
> >
> > $iconv -f gb2312 -t utf-8 messages_gb2312.po>messages.po
> > $msgfmt messages.po
> >
> > But the page didn't display UTF-8 strings.
> >
> 
> Be sure to specify "Content-Type" in your po file.
> 
> ex.
> "Content-Type: text/plain; charset=gb2312"
> "Content-Type: text/plain; charset=UTF-8"
> 
> Moriyoshi
> 
> 

--- End Message ---
--- Begin Message ---
On 2004/02/10, at 9:39, walter fan wrote:


Hi,Moriyoshi

Thanks for your information. There is a strange thing, I had modified it to charset=UTF-8, it can not display UTF-8 strings. But I modified it to charset=CHARSET, it's okay. My server is RedHat Linux7.2.

Perhaps you don't issue a correct HTTP header to the browser.


Just add header('Content-Type: text/html; charset=UTF-8')
at the top of your script.

And try using bind_textdomain_codeset() to specify the output charset, which you
want to send the texts as.


Btw, why the translation strings still display by gettext after I deleted the po and mo files in locale folder?

Do you mean translated strings continue to appear if the message catalogs are removed?
Maybe you have two or more base directory and you are specifying another base to bindtextdomain()
which isn't involved with the removed catalogs.


Moriyoshi

Thanks & Regards,
Walter Fan
----- Original Message -----
From: "Moriyoshi Koizumi" <[EMAIL PROTECTED]>
To: "walter fan" <[EMAIL PROTECTED]>
Sent: Tuesday, February 10, 2004 3:37 AM
Subject: Re: [PHP-I18N] gettext and utf8


On 2004/02/05, at 12:42, walter fan wrote:

2.I modified the charset of php files and messages.po to UTF-8

$iconv -f gb2312 -t utf-8 messages_gb2312.po>messages.po
$msgfmt messages.po

But the page didn't display UTF-8 strings.


Be sure to specify "Content-Type" in your po file.


ex.
"Content-Type: text/plain; charset=gb2312"
"Content-Type: text/plain; charset=UTF-8"

Moriyoshi





--- End Message ---
--- Begin Message ---
Moriyoshi,thanks again. .

I tried your method,but my question still existed. Though I modified the
charset using header('Content-Type: text/html; charset=UTF-8'),
 the page still display Chinese characters. So these strings display
garbage.

> Do you mean translated strings continue to appear if the message
> catalogs are removed?[Walter say:Yes]
> Maybe you have two or more base directory and you are specifying
> another base to bindtextdomain()
> which isn't involved with the removed catalogs.

I only have one directory in my server and wrote only one bindtextdomain
function.

I think it may be caused by server(Redhat
Linux7.3+Apache1.3+PHP4.2.2+gettext0.13), but I didn't got the real reason.
Shall I make some configure for the server?

Thanks & Regards,
Walter Fan

"Moriyoshi Koizumi" <[EMAIL PROTECTED]>
>
> On 2004/02/10, at 9:39, walter fan wrote:
>
> > Hi,Moriyoshi
> >
> > Thanks for your information. There is a strange thing, I had modified
> > it to charset=UTF-8, it can not display UTF-8 strings. But I modified
> > it to charset=CHARSET, it's okay. My server is RedHat Linux7.2.
>
> Perhaps you don't issue a correct HTTP header to the browser.
>
> Just add header('Content-Type: text/html; charset=UTF-8')
> at the top of your script.
>
> And try using bind_textdomain_codeset() to specify the output charset,
> which you
> want to send the texts as.
>
> > Btw, why the translation strings still display by gettext  after I
> > deleted the po and mo files in locale folder?
>
> Do you mean translated strings continue to appear if the message
> catalogs are removed?
> Maybe you have two or more base directory and you are specifying
> another base to bindtextdomain()
> which isn't involved with the removed catalogs.
>
> Moriyoshi
>
> > Thanks & Regards,
> > Walter Fan
> > ----- Original Message -----
> > From: "Moriyoshi Koizumi" <[EMAIL PROTECTED]>
> > To: "walter fan" <[EMAIL PROTECTED]>
> > Sent: Tuesday, February 10, 2004 3:37 AM
> > Subject: Re: [PHP-I18N] gettext and utf8
> >
> >
> >> On 2004/02/05, at 12:42, walter fan wrote:
> >>
> >>> 2.I modified the charset of php files and messages.po to UTF-8
> >>>
> >>> $iconv -f gb2312 -t utf-8 messages_gb2312.po>messages.po
> >>> $msgfmt messages.po
> >>>
> >>> But the page didn't display UTF-8 strings.
> >>>
> >>
> >> Be sure to specify "Content-Type" in your po file.
> >>
> >> ex.
> >> "Content-Type: text/plain; charset=gb2312"
> >> "Content-Type: text/plain; charset=UTF-8"
> >>
> >> Moriyoshi
> >>
> >>
> >
> >

--- End Message ---
--- Begin Message --- I have a functional gettext-based internationalized content management system for a while now. A number of translators have offered their support, and I have localization files for Swedish, Norwegian, Chinese, Arabic, Turkish, Japanese, Spanish, etc.

The PHP software system is utf-8 based, so character sets haven't been an issue. Indeed, everything's been working quite well, but I just noticed a procedural item that made me wonder what the best approach is.

When non-roman language translators (japanese, arabic, chinese) send me their messages.po files, I open and save them as "utf-8 (no BOM)" files to preserve their integrity. (I use BBEdit on Mac OS X, which handles this nicely).

When using Spanish, Swedish, etc files, however, many of the translators have converted the text strings to HTML entities, e.g. "espa&ntilde;ol". In one way, this makes sense, since they are to be displayed on a web page. But is it the right thing to do? Or should such strings be in messages.po with all their accents, and converted with htmlspecialchars() before output?

The issue cropped up because I'm converting the site to XHTML 1.1 output, and that means encoding things like ampersands. I have functions for creating drop-down menus (e.g. "categories" and "languages"). If a menu has an item like a "Crime & Punishment" category, I'd want to convert it to "Crime &amp; Punishment" for XHTML compliance. But I don't want the language menu to RE-encode "espa&ntilde;ol" as "espan&amp;ntilde;ol", which would screw everything up.

So what's the best way to handle the relationship between HTML entities and gettext-based messages.po files?

In fact, the larger question is: do HTML entities really need to be entity-ized on utf-8 pages, whose character set actually should be capable of displaying the characters? Obviously "htmlspecialchars()" handles characters that cause output problems (like < and >, which indicate tag opening/closing), but for a utf-8 based system, "n tilde" doesn't need to be encoded at all, does it?

It seems like early HTML education would state categorically that "espaņol" needs to be written as "espag&ntilde;ol" on a web page, but that isn't really true for utf-8 pages, is it?

spud.

-------------------------------------------------------------------
a.h.s. boy
spud(at)nothingness.org            "as yes is to if,love is to yes"
http://www.nothingness.org/
-------------------------------------------------------------------

--- End Message ---
--- Begin Message --- On 2004/02/11, at 1:45, a.h.s. boy wrote:

When using Spanish, Swedish, etc files, however, many of the translators have converted the text strings to HTML entities, e.g. "espa&ntilde;ol". In one way, this makes sense, since they are to be displayed on a web page. But is it the right thing to do? Or should such strings be in messages.po with all their accents, and converted with htmlspecialchars() before output?

Yep, I guess you should. It'd not be a good idea to have accented characters as
entities in the .po file, because it only makes sense when gettext is used in
conjunction with HTML / XML. Besides you won't need to convert such strings into
their entitied form as long as you choose UTF-8 as the output charset.


The issue cropped up because I'm converting the site to XHTML 1.1 output, and that means encoding things like ampersands. I have functions for creating drop-down menus (e.g. "categories" and "languages"). If a menu has an item like a "Crime & Punishment" category, I'd want to convert it to "Crime &amp; Punishment" for XHTML compliance. But I don't want the language menu to RE-encode "espa&ntilde;ol" as "espan&amp;ntilde;ol", which would screw everything up.

So what's the best way to handle the relationship between HTML entities and gettext-based messages.po files?

In fact, the larger question is: do HTML entities really need to be entity-ized on utf-8 pages, whose character set actually should be capable of displaying the characters? Obviously "htmlspecialchars()" handles characters that cause output problems (like < and >, which indicate tag opening/closing), but for a utf-8 based system, "n tilde" doesn't need to be encoded at all, does it?

It seems like early HTML education would state categorically that "espaņol" needs to be written as "espag&ntilde;ol" on a web page, but that isn't really true for utf-8 pages, is it?


spud.

-------------------------------------------------------------------
a.h.s. boy
spud(at)nothingness.org            "as yes is to if,love is to yes"
http://www.nothingness.org/
-------------------------------------------------------------------

--
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




--- End Message ---
--- Begin Message --- I just clicked "send" button too early. Please ignore the previous one, sorry :)

On 2004/02/11, at 1:45, a.h.s. boy wrote:

When using Spanish, Swedish, etc files, however, many of the translators have converted the text strings to HTML entities, e.g. "espa&ntilde;ol". In one way, this makes sense, since they are to be displayed on a web page. But is it the right thing to do? Or should such strings be in messages.po with all their accents, and converted with htmlspecialchars() before output?

Yep, I guess you should. It'd not be a good idea to have accented characters as
entities in the .po file, because it only makes sense when gettext is used in
conjunction with HTML / XML. Besides you won't need to convert such strings into
their entitied form as long as you choose UTF-8 as the output charset.


In fact, the larger question is: do HTML entities really need to be entity-ized on utf-8 pages, whose character set actually should be capable of displaying the characters? Obviously "htmlspecialchars()" handles characters that cause output problems (like < and >, which indicate tag opening/closing), but for a utf-8 based system, "n tilde" doesn't need to be encoded at all, does it?

They don't have to be entitized, as the core idea behind HTML entitiy is to represent
various characters in a document written in a legacy character set which are not always
available across any other character sets. UTF-8 is developed to resolve such issues.


Moriyoshi
--- End Message ---
--- Begin Message --- Hi,

Is there any plans to make a multibyte version of strcoll? Or at least a version which supports utf-8 and uses the unicode collation algorithm/tables (http://www.unicode.org/reports/tr10/)?

Regards,
Brodie

--- End Message ---

Reply via email to