php-i18n Digest 2 Mar 2003 17:39:55 -0000 Issue 156

Topics (messages 471 through 486):

Re: detecting katakana vs hiragana
        471 by: Moriyoshi Koizumi
        473 by: Simon Dedeyne
        474 by: Moriyoshi Koizumi

Re: Chasen Questions
        472 by: Moriyoshi Koizumi

Allowing for time differences and language differences.
        475 by: Ian A. Gray
        476 by: Gary Ross

Re: Is it multi-byte safe?
        477 by: Jean-Christian Imbeault
        479 by: Moriyoshi Koizumi
        480 by: Jean-Christian Imbeault
        481 by: Jean-Christian Imbeault
        482 by: Moriyoshi Koizumi
        483 by: Jean-Christian Imbeault
        484 by: David Emery
        485 by: Jean-Christian Imbeault
        486 by: Moriyoshi Koizumi

Re: Internationalized feeding of MySQL
        478 by: Jean-Christian Imbeault

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
On Fri, 28 Feb 2003 12:28:18 +0100
"Simon Dedeyne" <[EMAIL PROTECTED]> wrote:

> 
> Hi,
> 
> Is there a way/function for detecting if a word is in katakana or
> hiragana?
> I'm using UTF-8 encoding. 
> 
> I don't know if it's of any help, but I wonder if a solution could be
> find through a regular expression option, like indicated (though not
> explained) here http://regex.info/indexlist.html
> 
> Tnx,
> Simon

You can accomplish it by converting any hiragana / katakana mixture to either
hiragana or katakana characters with mb_convert_kana() and then comparing
it with the original.

Also try mb_ereg_* functions that allows localised characters to be used in
character classes.

Moriyoshi

--- End Message ---
--- Begin Message ---
I haven't tried the mb_ereg solution yet, but I thought of doing something with 
mb_convert_kana()
Here's a little example script. I try it, but nothing gets converted at all! (I have 
mb functions
Enabled and am using PHP 4.3.0 on a Window XP system). Why doesn’t this work?

Simon

<html>
<head>
<meta http-equiv="Content-Type" content="Text/Html; Charset=UTF-8">
<title>hira2kata</title>
</head>
<body>
<?php
$str="わたしわ";
// watashi 
echo $str."<br>";
$str = mb_convert_kana($str, "c");
echo $str."<br>"; 
$str = mb_convert_kana($str, "C");
echo $str; 
?>
</body>
</html>




You can accomplish it by converting any hiragana / katakana mixture to either hiragana 
or katakana characters with mb_convert_kana() and then comparing it with the original.

Also try mb_ereg_* functions that allows localised characters to be used in character 
classes.

Moriyoshi


--- End Message ---
--- Begin Message ---
On Fri, 28 Feb 2003 14:43:38 +0100
"Simon Dedeyne" <[EMAIL PROTECTED]> wrote:

> >
> I haven't tried the mb_ereg solution yet, but I thought of doing something with 
> mb_convert_kana()
> Here's a little example script. I try it, but nothing gets converted at all! (I have 
> mb functions
> Enabled and am using PHP 4.3.0 on a Window XP system). Why doesn’t this work?

Are you sure that you did set the right encoding (UTF-8) to mbstring.internal_encoding?
Or try using the third parameter: mb_convert_kana($str, "C", "UTF-8") as another 
solution.

Moriyoshi

--- End Message ---
--- Begin Message ---
Hi,

On Thu, 27 Feb 2003 00:18:36 +0900
Gary Ross <[EMAIL PROTECTED]> wrote:

> I have chasen working fine as of 2.3.0 (with darts 0.1 installed)
> 
> 1.
> It works fine with the
> SUFDIC and PATDIC
> but I get this error if I try to use DADIC
>
> I couldn't find any information regarding how to actually install 
> chadic.lex

That seems to be because of some errors in the package.

> 2. Which is better to use SUFDIC or PATDIC?
> Speed is not really an issue as I'm on an extremely fast dedicated 
> server.
> I don't really understand the differences.

Depends on how long is the text you are parsing:

According to the document,

SUFDIC => fast on startup, but slow on word lookup
PATDIC => slow on startup, but fast on word lookup

That is, when you try to parse bunch of short texts over multiple sessions,
SUFDIC is the choice. Otherwise, when you try to parse some long texts
in a single session, PATDIC should make sense.
 
> 3. How do I add works to the dictionary in a *nix environment?
> The faq on the home page has this question then a blank space !

Anyway this may be a wrong place for those questions because they have
nothing at all to do with PHP itself. As it's unlikely that you'll get
plenty of responses from Chasen experts here, I think it'd be better if
you contacted to the author directly.

Moriyoshi

--- End Message ---
--- Begin Message ---
Hi.  I am quite new to php.  I have a website which is hosted (using a hosting 
company) on linux servers and uses PHP 4.1.2 (wish they would upgrade it!)

I have a few questions:

1) I want to have a PHP script which I can then call bu the include() command for a 
few of my web pages that will greet you with Good morning, good afternoon and good 
evening.  I know there are many examples of these to be found.  But don't these only 
work if the person viewing the pages lives in the same time-zone as my server?  Is 
there a way of displaying the correct greeting for the viewers particular time-zone?

2) I would like to show the correct date (again so that it will be the correct date 
for the viewers time-zone) but will also output in the correct way for either US or UK 
viewers.  For example if a person looks at my site in the UK they may see: 28th 
February 2003 and in the US they would see February 28, 2003 (by the way, is that the 
correct way of putting the date for the US?)

3) If it is indeed possible for PHP to differentiate between someone from the US, and 
the UK (or indeed other countries) can it then output different spellings?  For 
example if part of my website had some text:

"A man with a grey-coloured top walked along the pavement and then onto the road. He 
saw a car in the distance with a purple boot and a blue bonnet."

would it be possible for PHP to convert it for an American viewer:

"A man with a gray-colored top walked along the sidewalk and then onto the pavement.  
He saw an automobile in the distance with a purple trunk and a blue hood."

Excuse me if I translated it into American English wrongly!

I would be grateful if people could come up with some solutions for these things.

Regards,

Ian Gray



---------------------------------
Ian A. Gray
Manchester, UK
Telephone: +44 (0) 161 224 1635 - Fax: +44 (0) 870 135 0061 - Mobile: +44 (0) 7900 996 
328
US Fax no.:  707-885-3582
E-mail: [EMAIL PROTECTED] - Websites: www.baritone.uk.com (performance) & 
www.vocalstudio.co.uk (Tuition)
---------------------------------




---------------------------------
With Yahoo! Mail you can get a bigger mailbox -- choose a size that fits your needs

--- End Message ---
--- Begin Message --- Hello,


1) I want to have a PHP script which I can then call bu the include() command for a few of my web pages that will greet you with Good morning, good afternoon and good evening. I know there are many examples of these to be found. But don't these only work if the person viewing the pages lives in the same time-zone as my server? Is there a way of displaying the correct greeting for the viewers particular time-zone?

First you'd have to get the clients time-zone based on their ip-address. This is not as easy as it seems, but I think there are scripts that would do the trick. Try hotscripts.com or something similar. The built in time() date() gmtdate() functions can help get the actual time of the viewer based on their zone.




2) I would like to show the correct date (again so that it will be the correct date for the viewers time-zone) but will also output in the correct way for either US or UK viewers. For example if a person looks at my site in the UK they may see: 28th February 2003 and in the US they would see February 28, 2003 (by the way, is that the correct way of putting the date for the US?)


3) If it is indeed possible for PHP to differentiate between someone from the US, and the UK (or indeed other countries) can it then output different spellings? For example if part of my website had some text:

"A man with a grey-coloured top walked along the pavement and then onto the road. He saw a car in the distance with a purple boot and a blue bonnet."

would it be possible for PHP to convert it for an American viewer:

"A man with a gray-colored top walked along the sidewalk and then onto the pavement. He saw an automobile in the distance with a purple trunk and a blue hood."

I think what you ask is essentially impossible or would require some high level 'translation' software to do the trick. This is because of the contextual difficulties of word conversion that we can do without thinking but computers are notoriously bad at.


If you just try changing 'boot' to 'trunk' then many times the results aren't what you expected:
I put the suitcase in the boot. --> I put the suitcase in the trunk.
I was given the boot (by my company)--> I was given the trunk ???
I bought a new pair of boots --> I bought a new pair of trunks (different meaning completely)


You may do better with the basic spelling issues (changing colour --> color etc) using str_replace

$text = str_replace('color', 'colour', $text);

Make a list and put it in an array and loop around the str_replace function.

Gary


--- End Message ---
--- Begin Message --- Is addslashes() multi-byte safe?

I will bu sing it to escape multi-byte input and wouldn't want it to mangle anything...

Thanks,

Jc


--- End Message ---
--- Begin Message ---
Jean-Christian Imbeault <[EMAIL PROTECTED]> wrote:

> Is addslashes() multi-byte safe?
> 
> I will bu sing it to escape multi-byte input and wouldn't want it to 
> mangle anything...

Partially yes.

Strings encoded in GB2312(CP936), big5, Shift_JIS are known to be 
clobbered by addslashes().

UTF-8, EUC-JP, EUC-KR, EUC-CN and EUC-TW are not affected.

Moriyoshi


--- End Message ---
--- Begin Message --- Moriyoshi Koizumi wrote:

Partially yes.


Strings encoded in GB2312(CP936), big5, Shift_JIS are known to be clobbered by addslashes().

Sh*t ... and I just added a whole bunch of addslashes() to my code to prevent SQL attacks. And of course my web pages are for Japanese ... and most of them will be using SJIS.


If I have internal_encoding set to EUC-JP does that mean that all POST or GET vars passed in will be translated to EUC-Jp and hence my addslahes will be fine?

I sure hope so ...

Jc


--- End Message ---
--- Begin Message --- Moriyoshi Koizumi wrote:

Partially yes.


Strings encoded in GB2312(CP936), big5, Shift_JIS are known to be clobbered by addslashes().

Sh*t ... and I just added a whole bunch of addslashes() to my code to prevent SQL attacks. And of course my web pages are for Japanese ... and most of them will be using SJIS.


If I have internal_encoding set to EUC-JP does that mean that all POST or GET vars passed in will be translated to EUC-Jp and hence my addslahes will be fine?

I sure hope so ...

Jc


--- End Message ---
--- Begin Message ---
Jean-Christian Imbeault <[EMAIL PROTECTED]> wrote:

> Moriyoshi Koizumi wrote:
> > 
> > Partially yes.
> > 
> > Strings encoded in GB2312(CP936), big5, Shift_JIS are known to be 
> > clobbered by addslashes().
> 
> Sh*t ... and I just added a whole bunch of addslashes() to my code to 
> prevent SQL attacks. And of course my web pages are for Japanese ... and 
> most of them will be using SJIS.
> 
> If I have internal_encoding set to EUC-JP does that mean that all POST 
> or GET vars passed in will be translated to EUC-Jp and hence my 
> addslahes will be fine?

That's the case as long as the browser precisely sends form contents as 
EUC-JP encoded strings and no automagical encoding conversion is performed 
there by mbstring module (I mean output_handler=mb_output_handler in ini 
settings). Then you have to prepare the page contents to be encoded in 
EUC-JP.

But it's very probable that clients send form contents in UTF-8 when GET 
method is used..

Moriyoshi


--- End Message ---
--- Begin Message --- Moriyoshi Koizumi wrote:

That's the case as long as the browser precisely sends form contents as EUC-JP encoded strings and no automagical encoding conversion is performed there by mbstring module (I mean output_handler=mb_output_handler in ini settings). Then you have to prepare the page contents to be encoded in EUC-JP.

Ok, no output_handler=mb_output_handler in my php.ini :) I am using the recommended php.ini (not the default) and it has these settings by default:


;mbstring.language = Japanese
;mbstring.internal_encoding = EUC-JP
;mbstring.http_input = auto
;mbstring.http_output = SJIS
;mbstring.encoding_translation = Off

I see that mbstring.encoding_translation = Off ... and a phpinfo() shows that I have compiled with:

--enable-mbstring-enc-trans

But it also says:
                               Local Global
mbstring.encoding_translation   Off   Off

I had assumed taht my compile time option would overide the php.ini setting but I guess I was wrong?

If I want all encoming data will be translated to internal_encoding I guess I need to change this pnp.ini setting?

But it's very probable that clients send form contents in UTF-8 when GET method is used..

In which case I am safe :) But then again anyone who would want to try an SQL injection attack might try and send some SJIS ... better safe than sorry :)


Jc


--- End Message ---
--- Begin Message ---
On 2003.Mar.1, at 21:14 Asia/Tokyo, Jean-Christian Imbeault wrote:


Moriyoshi Koizumi wrote:
That's the case as long as the browser precisely sends form contents as EUC-JP encoded strings and no automagical encoding conversion is performed there by mbstring module (I mean output_handler=mb_output_handler in ini settings). Then you have to prepare the page contents to be encoded in EUC-JP.

Ok, no output_handler=mb_output_handler in my php.ini :) I am using the recommended php.ini (not the default) and it has these settings by default:


;mbstring.language = Japanese
;mbstring.internal_encoding = EUC-JP
;mbstring.http_input = auto
;mbstring.http_output = SJIS
;mbstring.encoding_translation = Off

You need to un-escape these in php.ini (take out the semi-colons) and turn the encoding translation on if you want it to work. Then the form input will be converted to EUC before your script gets it and you won't have problems with addslashes().


But be careful if you already have a DB full of SJIS encoded Japanese. You'll probably want to convert it to EUC before you switch this on to avoid a big mess.


I see that mbstring.encoding_translation = Off ... and a phpinfo() shows that I have compiled with:


--enable-mbstring-enc-trans

But it also says:
                               Local Global
mbstring.encoding_translation   Off   Off

I had assumed taht my compile time option would overide the php.ini setting but I guess I was wrong?

If I want all encoming data will be translated to internal_encoding I guess I need to change this pnp.ini setting?

But it's very probable that clients send form contents in UTF-8 when GET method is used..

In which case I am safe :) But then again anyone who would want to try an SQL injection attack might try and send some SJIS ... better safe than sorry :)


Jc


-- PHP Internationalization Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php




--- End Message ---
--- Begin Message --- From an email. Reposting to to list for thos who might have the same question later on :)

Moriyoshi Koizumi wrote:
>
> Oops, I should have said mbstring.encoding_translation=on actually :)

Ok. Turning that on.

>>In which case I am safe :) But then again anyone who would want to try
>>an SQL injection attack might try and send some SJIS ... better safe
>>than sorry :)
>
>
> It took some minutes to sort out what you're saying here.. By the word
> "clients" I meant browsers and there I was trying to mention a case that
> some browsers that have certain settings try to send GET queries in UTF-8
> while such queries are basically supposed to be encoded in the same
> encoding as that the page is written in.


Sorry if my intentions were not clear but I am trying to protect myself from SQL injection attacks by using addslashes() to user provided information. I cannot assume anything about the incoming data (not even the encoding) since anyone trying to hack my machine by using such a technique could pretty much send whatever they wanted using a telnet session or what not ...

> Anyway, Shift_JIS is not a great choice for PHP scripting.

Tell me about it. I have the hardest time getting the people who actually make the HTML page to use EUC instead of SJIS. Of course they all use MS platforms to create the HTML content so they can't understand why SJIS causes me pain when I try and edit it in *NIX box or parse it in PHP ...

Thanks for the info!

Jc


--- End Message ---
--- Begin Message ---
Jean-Christian Imbeault <[EMAIL PROTECTED]> wrote:
> Sorry if my intentions were not clear but I am trying to protect myself 
> from SQL injection attacks by using addslashes() to user provided 
> information. I cannot assume anything about the incoming data (not even 
> the encoding) since anyone trying to hack my machine by using such a 
> technique could pretty much send whatever they wanted using a telnet 
> session or what not ...

Sorry for my misleading words too... SQL injection attacks can be 
prevented with a self-made addslashes() even if you choose SJIS for the 
internal charset.

example:

<?php
mb_internal_encoding("Shift_JIS");
$escaped = mb_ereg_replace("([\\\"'\0])", "\\\\1", $sjis_string);
?>

>  > Anyway, Shift_JIS is not a great choice for PHP scripting.
> 
> Tell me about it. I have the hardest time getting the people who 
> actually make the HTML page to use EUC instead of SJIS. Of course they 
> all use MS platforms to create the HTML content so they can't understand 
> why SJIS causes me pain when I try and edit it in *NIX box or parse it 
> in PHP ...

The main reason is that several SJIS characters, each of which is a 
compound of the lead byte and the second byte, may contain a byte for the 
second byte whose value is the same as the character code of "\" 
(backslash = \x5c) and such double-byte characters are unfortunately 
mistreated by PHP since backslashes are also used for escape sequences in 
string literals.

http://www.microsoft.com/globaldev/reference/dbcs/932.htm

You can avoid this issue by configuring a PHP build 
with --enable-zend-multibyte option and set mbstring.script_encoding to 
SJIS.

Also keep in mind that the same thing applies to
CP936(a GB2312 variant, used in the simplified Chinese version of Windows), 
CP949(a KSC5601 variant, used in the Korean version of Windows), and 
CP950(big5, used in the traditional Chinese version of Windows).

However, as of the current implementation, the character sets / encodings 
mentioned above are not supported by the zend multibyte stuff.

Hope this helps,

Moriyoshi


--- End Message ---
--- Begin Message --- A . H . S . Boy wrote:

Fulltext index searching, however, seems to fail miserably with the Japanese text...it just plain doesn't work. No results returned ever. Anyone know anything about that?

Don't know why MySQL cannot do it but if it is important to you PostgreSQl handles japanese (and any other language) and can do full-text searches.


Jc


--- End Message ---

Reply via email to