Le 29/10/2013 17:15, Jörg Knappen a écrit :
After running this script, a few more things were there:
Non-normalised accents and some really strange
encodings I could not really explain but rather guess their meanings, like
s/Ãœ/Ü/g
s/É/É/g
s/AÌ€/À/g
s/aÌ€/à/g
s/EÌ€/È/g
s/eÌ€/è/g
s/„/„/g
Thanks again!
My updated sed pattern generator now looks like:
r = range(0xa0, 0x170)
file = open(fixu8.sed, w)
for i in r:
pat1 = s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + / + unichr(i).encode(utf-8) +/g
print file, pat1
try:
pat2 =
Le 30/10/2013 16:13, Jörg Knappen a écrit :
Thanks again!
My updated sed pattern generator now looks like:
r = range(0xa0, 0x170)
file = open(fixu8.sed, w)
for i in r:
pat1 =
s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + /
+ unichr(i).encode(utf-8) +/g
print file, pat1
try:
The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw
C1 control characters for all of latin-1. So I had to do them, too.
The data werent consistent at all, not even in their errors.
--Jrg Knappen
Gesendet:Mittwoch,
Le 30/10/2013 17:32, Jörg Knappen a écrit :
The data did not only contain latin-1 type mangling for the
non-existent Windows characters, but also sequences with the raw
C1 control characters for all of latin-1. So I had to do them, too.
The data weren't consistent at all, not even in their
On Wed, Oct 30, 2013 at 9:56 AM, Frédéric Grosshans
frederic.grossh...@gmail.com wrote:
Le 30/10/2013 17:32, Jörg Knappen a écrit :
The data did not only contain latin-1 type mangling for the non-existent
Windows characters, but also sequences with the raw
C1 control characters for all of
Let me include the unicode alias as well for wider audience since this topic
came up few times in the past.
From: James Lin james_...@symantec.commailto:james_...@symantec.com
Date: Wednesday, October 30, 2013 at 1:11 PM
To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org
EAI doesn't really specify anything more than the older SMTP about validating
email addresses. Everything in the local part = U+0080 is permissible and up
to the server to sort out what characters it wants to allow, how it wants to
map things like Turkish I, etc. Some code points are clearly
Hi
I am not expecting a single regular expression to solve all possible
combination of scripts. What I am looking for probably (which may not be
possible due to combination of scripts and mix scripts) is somewhere along the
line of having individual scripts that validate by the regular
Mixed script stuff considerations are all supposed to be done by the mailbox
administrator. It's perfectly valid for a domain to assign Latin addresses and
also Cyrillic ones. Indeed for Cyrillic EAI, one probably would almost
certainly require ASCII (eg: Latin) aliases during whatever the
You should not ttempt to detect scripts or even assume that they are
encoded based on Unicode, in the username part ; all you can do is to
break at the first @ to split it between user name part and the domin
name, then use the IDN specs to validate the domain name part.
* 1. Domain name part:
For EAI (the question being asked), the entire address, local part and domain,
are encoded in UTF-8.
-Shawn
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf
Of Philippe Verdy
Sent: Wednesday, October 30, 2013 4:08 PM
To: James Lin
Cc: Paweł Dyda;
2013/10/31 Shawn Steele shawn.ste...@microsoft.com
For EAI (the question being asked), the entire address, local part and
domain, are encoded in UTF-8.
No. the question being sked (by James Lin) did NOT include this
restriction:
does anyone has the best practice or guideline on how to
13 matches
Mail list logo