UTF-8 (Japanese) strings which look like ASCII strings

Richard Jones Fri, 16 Apr 2004 06:21:47 -0700

A user on one of the sites that I run has managed to create two
user accounts for themselves:


Yoshi
%EF%BC%B9%EF%BD%8F%EF%BD%93%EF%BD%88%EF%BD%89  (UTF-8 using URL encoding)

When rendered in a web browser they both appear as "Yoshi", but from
the point of view of my code and the database they are, of course,
different.  I allow people to have unrestricted usernames rather than
restricting them to ASCII-printable-only characters because this makes
sense on a Japanese site.

The problem though is that this user cannot log in to the non-ASCII
account.  Or at least they could do if I could explain in length what
has happened, and if they understood my explanation, but they
shouldn't have to do this to use a web site.

Is there a way to solve this?  For example, is it feasible to work out
if a general UTF-8 string has a lossless representation in ASCII and
do this conversion?  [Note in the second string above, it looks as if
the Japanese part of Unicode contains a second mapping of the Roman
character set, so presumably this is not a straightforward conversion]

Alternately (and I don't really want to do this) is it possible to
have an HTML form which accepts UTF-8 charset in most fields, but one
field is limited to ASCII-only?

Is it a good idea to allow unrestricted usernames in any case?

Rich.

-- 
Richard Jones. http://www.annexia.org/ http://www.j-london.com/
Merjis Ltd. http://www.merjis.com/ - improving website return on investment
MONOLITH is an advanced framework for writing web applications in C, easier
than using Perl & Java, much faster and smaller, reusable widget-based arch,
database-backed, discussion, chat, calendaring:
http://www.annexia.org/freeware/monolith/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

UTF-8 (Japanese) strings which look like ASCII strings

Reply via email to