Re: UTF-8 (Japanese) strings which look like ASCII strings

Glenn Maynard Fri, 16 Apr 2004 12:47:54 -0700

On Fri, Apr 16, 2004 at 02:19:50PM +0100, Richard Jones wrote:
> A user on one of the sites that I run has managed to create two
> user accounts for themselves:
> 
> Yoshi
> %EF%BC%B9%EF%BD%8F%EF%BD%93%EF%BD%88%EF%BD%89  (UTF-8 using URL encoding)

Which, for reference, is "Ｙｏｓｈｉ" (double-width).

> When rendered in a web browser they both appear as "Yoshi", but from
> the point of view of my code and the database they are, of course,
> different.  I allow people to have unrestricted usernames rather than
> restricting them to ASCII-printable-only characters because this makes
> sense on a Japanese site.

A simple fix for the above would be to convert U+FFxx characters; that's
all of those double-width roman characters, along with half-width
katakana and punctuation, which could present a similar problem.

  http://www.unicode.org/charts/PDF/UFF00.pdf

Another approach would be to convert to NFKC or NFKD.  This might do more
than you want, however; see

  http://www.unicode.org/charts/normalization/chart_Katakana.html

-- 
Glenn Maynard

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: UTF-8 (Japanese) strings which look like ASCII strings

Reply via email to