php-i18n Digest 2 Mar 2003 17:39:55 -0000 Issue 156
Topics (messages 471 through 486):
Re: detecting katakana vs hiragana
471 by: Moriyoshi Koizumi
473 by: Simon Dedeyne
474 by: Moriyoshi Koizumi
Re: Chasen Questions
472 by: Moriyoshi Koizumi
Allowing for time differences and language differences.
475 by: Ian A. Gray
476 by: Gary Ross
Re: Is it multi-byte safe?
477 by: Jean-Christian Imbeault
479 by: Moriyoshi Koizumi
480 by: Jean-Christian Imbeault
481 by: Jean-Christian Imbeault
482 by: Moriyoshi Koizumi
483 by: Jean-Christian Imbeault
484 by: David Emery
485 by: Jean-Christian Imbeault
486 by: Moriyoshi Koizumi
Re: Internationalized feeding of MySQL
478 by: Jean-Christian Imbeault
Administrivia:
To subscribe to the digest, e-mail:
[EMAIL PROTECTED]
To unsubscribe from the digest, e-mail:
[EMAIL PROTECTED]
To post to the list, e-mail:
[EMAIL PROTECTED]
----------------------------------------------------------------------
--- Begin Message ---
On Fri, 28 Feb 2003 12:28:18 +0100
"Simon Dedeyne" <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> Is there a way/function for detecting if a word is in katakana or
> hiragana?
> I'm using UTF-8 encoding.
>
> I don't know if it's of any help, but I wonder if a solution could be
> find through a regular expression option, like indicated (though not
> explained) here http://regex.info/indexlist.html
>
> Tnx,
> Simon
You can accomplish it by converting any hiragana / katakana mixture to either
hiragana or katakana characters with mb_convert_kana() and then comparing
it with the original.
Also try mb_ereg_* functions that allows localised characters to be used in
character classes.
Moriyoshi
--- End Message ---
--- Begin Message ---
I haven't tried the mb_ereg solution yet, but I thought of doing something with
mb_convert_kana()
Here's a little example script. I try it, but nothing gets converted at all! (I have
mb functions
Enabled and am using PHP 4.3.0 on a Window XP system). Why doesn’t this work?
Simon
<html>
<head>
<meta http-equiv="Content-Type" content="Text/Html; Charset=UTF-8">
<title>hira2kata</title>
</head>
<body>
<?php
$str="ã‚ãŸã—ã‚";
// watashi
echo $str."<br>";
$str = mb_convert_kana($str, "c");
echo $str."<br>";
$str = mb_convert_kana($str, "C");
echo $str;
?>
</body>
</html>
You can accomplish it by converting any hiragana / katakana mixture to either hiragana
or katakana characters with mb_convert_kana() and then comparing it with the original.
Also try mb_ereg_* functions that allows localised characters to be used in character
classes.
Moriyoshi
--- End Message ---
--- Begin Message ---
On Fri, 28 Feb 2003 14:43:38 +0100
"Simon Dedeyne" <[EMAIL PROTECTED]> wrote:
> >
> I haven't tried the mb_ereg solution yet, but I thought of doing something with
> mb_convert_kana()
> Here's a little example script. I try it, but nothing gets converted at all! (I have
> mb functions
> Enabled and am using PHP 4.3.0 on a Window XP system). Why doesn’t this work?
Are you sure that you did set the right encoding (UTF-8) to mbstring.internal_encoding?
Or try using the third parameter: mb_convert_kana($str, "C", "UTF-8") as another
solution.
Moriyoshi
--- End Message ---
--- Begin Message ---
Hi,
On Thu, 27 Feb 2003 00:18:36 +0900
Gary Ross <[EMAIL PROTECTED]> wrote:
> I have chasen working fine as of 2.3.0 (with darts 0.1 installed)
>
> 1.
> It works fine with the
> SUFDIC and PATDIC
> but I get this error if I try to use DADIC
>
> I couldn't find any information regarding how to actually install
> chadic.lex
That seems to be because of some errors in the package.
> 2. Which is better to use SUFDIC or PATDIC?
> Speed is not really an issue as I'm on an extremely fast dedicated
> server.
> I don't really understand the differences.
Depends on how long is the text you are parsing:
According to the document,
SUFDIC => fast on startup, but slow on word lookup
PATDIC => slow on startup, but fast on word lookup
That is, when you try to parse bunch of short texts over multiple sessions,
SUFDIC is the choice. Otherwise, when you try to parse some long texts
in a single session, PATDIC should make sense.
> 3. How do I add works to the dictionary in a *nix environment?
> The faq on the home page has this question then a blank space !
Anyway this may be a wrong place for those questions because they have
nothing at all to do with PHP itself. As it's unlikely that you'll get
plenty of responses from Chasen experts here, I think it'd be better if
you contacted to the author directly.
Moriyoshi
--- End Message ---
--- Begin Message ---
Hi. I am quite new to php. I have a website which is hosted (using a hosting
company) on linux servers and uses PHP 4.1.2 (wish they would upgrade it!)
I have a few questions:
1) I want to have a PHP script which I can then call bu the include() command for a
few of my web pages that will greet you with Good morning, good afternoon and good
evening. I know there are many examples of these to be found. But don't these only
work if the person viewing the pages lives in the same time-zone as my server? Is
there a way of displaying the correct greeting for the viewers particular time-zone?
2) I would like to show the correct date (again so that it will be the correct date
for the viewers time-zone) but will also output in the correct way for either US or UK
viewers. For example if a person looks at my site in the UK they may see: 28th
February 2003 and in the US they would see February 28, 2003 (by the way, is that the
correct way of putting the date for the US?)
3) If it is indeed possible for PHP to differentiate between someone from the US, and
the UK (or indeed other countries) can it then output different spellings? For
example if part of my website had some text:
"A man with a grey-coloured top walked along the pavement and then onto the road. He
saw a car in the distance with a purple boot and a blue bonnet."
would it be possible for PHP to convert it for an American viewer:
"A man with a gray-colored top walked along the sidewalk and then onto the pavement.
He saw an automobile in the distance with a purple trunk and a blue hood."
Excuse me if I translated it into American English wrongly!
I would be grateful if people could come up with some solutions for these things.
Regards,
Ian Gray
---------------------------------
Ian A. Gray
Manchester, UK
Telephone: +44 (0) 161 224 1635 - Fax: +44 (0) 870 135 0061 - Mobile: +44 (0) 7900 996
328
US Fax no.: 707-885-3582
E-mail: [EMAIL PROTECTED] - Websites: www.baritone.uk.com (performance) &
www.vocalstudio.co.uk (Tuition)
---------------------------------
---------------------------------
With Yahoo! Mail you can get a bigger mailbox -- choose a size that fits your needs
--- End Message ---
--- Begin Message ---
Hello,
1) I want to have a PHP script which I can then call bu the include()
command for a few of my web pages that will greet you with Good
morning, good afternoon and good evening. I know there are many
examples of these to be found. But don't these only work if the
person viewing the pages lives in the same time-zone as my server? Is
there a way of displaying the correct greeting for the viewers
particular time-zone?
First you'd have to get the clients time-zone based on their
ip-address. This is not as easy as it seems, but I think there are
scripts that would do the trick. Try hotscripts.com or something
similar. The built in time() date() gmtdate() functions can help get
the actual time of the viewer based on their zone.
2) I would like to show the correct date (again so that it will be the
correct date for the viewers time-zone) but will also output in the
correct way for either US or UK viewers. For example if a person
looks at my site in the UK they may see: 28th February 2003 and in the
US they would see February 28, 2003 (by the way, is that the correct
way of putting the date for the US?)
3) If it is indeed possible for PHP to differentiate between someone
from the US, and the UK (or indeed other countries) can it then output
different spellings? For example if part of my website had some text:
"A man with a grey-coloured top walked along the pavement and then
onto the road. He saw a car in the distance with a purple boot and a
blue bonnet."
would it be possible for PHP to convert it for an American viewer:
"A man with a gray-colored top walked along the sidewalk and then onto
the pavement. He saw an automobile in the distance with a purple
trunk and a blue hood."
I think what you ask is essentially impossible or would require some
high level 'translation' software to do the trick. This is because of
the contextual difficulties of word conversion that we can do without
thinking but computers are notoriously bad at.
If you just try changing 'boot' to 'trunk' then many times the results
aren't what you expected:
I put the suitcase in the boot. --> I put the suitcase in the trunk.
I was given the boot (by my company)--> I was given the trunk ???
I bought a new pair of boots --> I bought a new pair of trunks
(different meaning completely)
You may do better with the basic spelling issues (changing colour -->
color etc) using str_replace
$text = str_replace('color', 'colour', $text);
Make a list and put it in an array and loop around the str_replace
function.
Gary
--- End Message ---
--- Begin Message ---
Is addslashes() multi-byte safe?
I will bu sing it to escape multi-byte input and wouldn't want it to
mangle anything...
Thanks,
Jc
--- End Message ---
--- Begin Message ---
Jean-Christian Imbeault <[EMAIL PROTECTED]> wrote:
> Is addslashes() multi-byte safe?
>
> I will bu sing it to escape multi-byte input and wouldn't want it to
> mangle anything...
Partially yes.
Strings encoded in GB2312(CP936), big5, Shift_JIS are known to be
clobbered by addslashes().
UTF-8, EUC-JP, EUC-KR, EUC-CN and EUC-TW are not affected.
Moriyoshi
--- End Message ---
--- Begin Message ---
Moriyoshi Koizumi wrote:
Partially yes.
Strings encoded in GB2312(CP936), big5, Shift_JIS are known to be
clobbered by addslashes().
Sh*t ... and I just added a whole bunch of addslashes() to my code to
prevent SQL attacks. And of course my web pages are for Japanese ... and
most of them will be using SJIS.
If I have internal_encoding set to EUC-JP does that mean that all POST
or GET vars passed in will be translated to EUC-Jp and hence my
addslahes will be fine?
I sure hope so ...
Jc
--- End Message ---
--- Begin Message ---
Moriyoshi Koizumi wrote:
Partially yes.
Strings encoded in GB2312(CP936), big5, Shift_JIS are known to be
clobbered by addslashes().
Sh*t ... and I just added a whole bunch of addslashes() to my code to
prevent SQL attacks. And of course my web pages are for Japanese ... and
most of them will be using SJIS.
If I have internal_encoding set to EUC-JP does that mean that all POST
or GET vars passed in will be translated to EUC-Jp and hence my
addslahes will be fine?
I sure hope so ...
Jc
--- End Message ---
--- Begin Message ---
Jean-Christian Imbeault <[EMAIL PROTECTED]> wrote:
> Moriyoshi Koizumi wrote:
> >
> > Partially yes.
> >
> > Strings encoded in GB2312(CP936), big5, Shift_JIS are known to be
> > clobbered by addslashes().
>
> Sh*t ... and I just added a whole bunch of addslashes() to my code to
> prevent SQL attacks. And of course my web pages are for Japanese ... and
> most of them will be using SJIS.
>
> If I have internal_encoding set to EUC-JP does that mean that all POST
> or GET vars passed in will be translated to EUC-Jp and hence my
> addslahes will be fine?
That's the case as long as the browser precisely sends form contents as
EUC-JP encoded strings and no automagical encoding conversion is performed
there by mbstring module (I mean output_handler=mb_output_handler in ini
settings). Then you have to prepare the page contents to be encoded in
EUC-JP.
But it's very probable that clients send form contents in UTF-8 when GET
method is used..
Moriyoshi
--- End Message ---
--- Begin Message ---
Moriyoshi Koizumi wrote:
That's the case as long as the browser precisely sends form contents as
EUC-JP encoded strings and no automagical encoding conversion is performed
there by mbstring module (I mean output_handler=mb_output_handler in ini
settings). Then you have to prepare the page contents to be encoded in
EUC-JP.
Ok, no output_handler=mb_output_handler in my php.ini :) I am using the
recommended php.ini (not the default) and it has these settings by default:
;mbstring.language = Japanese
;mbstring.internal_encoding = EUC-JP
;mbstring.http_input = auto
;mbstring.http_output = SJIS
;mbstring.encoding_translation = Off
I see that mbstring.encoding_translation = Off ... and a phpinfo() shows
that I have compiled with:
--enable-mbstring-enc-trans
But it also says:
Local Global
mbstring.encoding_translation Off Off
I had assumed taht my compile time option would overide the php.ini
setting but I guess I was wrong?
If I want all encoming data will be translated to internal_encoding I
guess I need to change this pnp.ini setting?
But it's very probable that clients send form contents in UTF-8 when GET
method is used..
In which case I am safe :) But then again anyone who would want to try
an SQL injection attack might try and send some SJIS ... better safe
than sorry :)
Jc
--- End Message ---
--- Begin Message ---
On 2003.Mar.1, at 21:14 Asia/Tokyo, Jean-Christian Imbeault wrote:
Moriyoshi Koizumi wrote:
That's the case as long as the browser precisely sends form contents
as EUC-JP encoded strings and no automagical encoding conversion is
performed there by mbstring module (I mean
output_handler=mb_output_handler in ini settings). Then you have to
prepare the page contents to be encoded in EUC-JP.
Ok, no output_handler=mb_output_handler in my php.ini :) I am using
the recommended php.ini (not the default) and it has these settings by
default:
;mbstring.language = Japanese
;mbstring.internal_encoding = EUC-JP
;mbstring.http_input = auto
;mbstring.http_output = SJIS
;mbstring.encoding_translation = Off
You need to un-escape these in php.ini (take out the semi-colons) and
turn the encoding translation on if you want it to work. Then the form
input will be converted to EUC before your script gets it and you won't
have problems with addslashes().
But be careful if you already have a DB full of SJIS encoded Japanese.
You'll probably want to convert it to EUC before you switch this on to
avoid a big mess.
I see that mbstring.encoding_translation = Off ... and a phpinfo()
shows that I have compiled with:
--enable-mbstring-enc-trans
But it also says:
Local Global
mbstring.encoding_translation Off Off
I had assumed taht my compile time option would overide the php.ini
setting but I guess I was wrong?
If I want all encoming data will be translated to internal_encoding I
guess I need to change this pnp.ini setting?
But it's very probable that clients send form contents in UTF-8 when
GET method is used..
In which case I am safe :) But then again anyone who would want to try
an SQL injection attack might try and send some SJIS ... better safe
than sorry :)
Jc
--
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--- End Message ---
--- Begin Message ---
From an email. Reposting to to list for thos who might have the same
question later on :)
Moriyoshi Koizumi wrote:
>
> Oops, I should have said mbstring.encoding_translation=on actually :)
Ok. Turning that on.
>>In which case I am safe :) But then again anyone who would want to try
>>an SQL injection attack might try and send some SJIS ... better safe
>>than sorry :)
>
>
> It took some minutes to sort out what you're saying here.. By the word
> "clients" I meant browsers and there I was trying to mention a case that
> some browsers that have certain settings try to send GET queries in
UTF-8
> while such queries are basically supposed to be encoded in the same
> encoding as that the page is written in.
Sorry if my intentions were not clear but I am trying to protect myself
from SQL injection attacks by using addslashes() to user provided
information. I cannot assume anything about the incoming data (not even
the encoding) since anyone trying to hack my machine by using such a
technique could pretty much send whatever they wanted using a telnet
session or what not ...
> Anyway, Shift_JIS is not a great choice for PHP scripting.
Tell me about it. I have the hardest time getting the people who
actually make the HTML page to use EUC instead of SJIS. Of course they
all use MS platforms to create the HTML content so they can't understand
why SJIS causes me pain when I try and edit it in *NIX box or parse it
in PHP ...
Thanks for the info!
Jc
--- End Message ---
--- Begin Message ---
Jean-Christian Imbeault <[EMAIL PROTECTED]> wrote:
> Sorry if my intentions were not clear but I am trying to protect myself
> from SQL injection attacks by using addslashes() to user provided
> information. I cannot assume anything about the incoming data (not even
> the encoding) since anyone trying to hack my machine by using such a
> technique could pretty much send whatever they wanted using a telnet
> session or what not ...
Sorry for my misleading words too... SQL injection attacks can be
prevented with a self-made addslashes() even if you choose SJIS for the
internal charset.
example:
<?php
mb_internal_encoding("Shift_JIS");
$escaped = mb_ereg_replace("([\\\"'\0])", "\\\\1", $sjis_string);
?>
> > Anyway, Shift_JIS is not a great choice for PHP scripting.
>
> Tell me about it. I have the hardest time getting the people who
> actually make the HTML page to use EUC instead of SJIS. Of course they
> all use MS platforms to create the HTML content so they can't understand
> why SJIS causes me pain when I try and edit it in *NIX box or parse it
> in PHP ...
The main reason is that several SJIS characters, each of which is a
compound of the lead byte and the second byte, may contain a byte for the
second byte whose value is the same as the character code of "\"
(backslash = \x5c) and such double-byte characters are unfortunately
mistreated by PHP since backslashes are also used for escape sequences in
string literals.
http://www.microsoft.com/globaldev/reference/dbcs/932.htm
You can avoid this issue by configuring a PHP build
with --enable-zend-multibyte option and set mbstring.script_encoding to
SJIS.
Also keep in mind that the same thing applies to
CP936(a GB2312 variant, used in the simplified Chinese version of Windows),
CP949(a KSC5601 variant, used in the Korean version of Windows), and
CP950(big5, used in the traditional Chinese version of Windows).
However, as of the current implementation, the character sets / encodings
mentioned above are not supported by the zend multibyte stuff.
Hope this helps,
Moriyoshi
--- End Message ---
--- Begin Message ---
A . H . S . Boy wrote:
Fulltext index searching, however, seems to fail miserably with the
Japanese text...it just plain doesn't work. No results returned ever.
Anyone know anything about that?
Don't know why MySQL cannot do it but if it is important to you
PostgreSQl handles japanese (and any other language) and can do
full-text searches.
Jc
--- End Message ---