Re: Aw: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Frédéric Grosshans
Le 29/10/2013 17:15, Jörg Knappen a écrit : After running this script, a few more things were there: Non-normalised accents and some really strange encodings I could not really explain but rather guess their meanings, like s/Ãœ/Ü/g s/É/É/g s/AÌ€/À/g s/aÌ€/à/g s/EÌ€/È/g s/eÌ€/è/g s/„/„/g

Aw: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Jörg Knappen
Thanks again! My updated sed pattern generator now looks like: r = range(0xa0, 0x170) file = open(fixu8.sed, w) for i in r: pat1 = s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + / + unichr(i).encode(utf-8) +/g print file, pat1 try: pat2 =

Re: Aw: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Frédéric Grosshans
Le 30/10/2013 16:13, Jörg Knappen a écrit : Thanks again! My updated sed pattern generator now looks like: r = range(0xa0, 0x170) file = open(fixu8.sed, w) for i in r: pat1 = s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + / + unichr(i).encode(utf-8) +/g print file, pat1 try:

Aw: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Jörg Knappen
The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw C1 control characters for all of latin-1. So I had to do them, too. The data werent consistent at all, not even in their errors. --Jrg Knappen Gesendet:Mittwoch,

Re: Aw: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Frédéric Grosshans
Le 30/10/2013 17:32, Jörg Knappen a écrit : The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw C1 control characters for all of latin-1. So I had to do them, too. The data weren't consistent at all, not even in their

Re: Aw: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Buck Golemon
On Wed, Oct 30, 2013 at 9:56 AM, Frédéric Grosshans frederic.grossh...@gmail.com wrote: Le 30/10/2013 17:32, Jörg Knappen a écrit : The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw C1 control characters for all of

Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread James Lin
Let me include the unicode alias as well for wider audience since this topic came up few times in the past. From: James Lin james_...@symantec.commailto:james_...@symantec.com Date: Wednesday, October 30, 2013 at 1:11 PM To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org

RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
EAI doesn't really specify anything more than the older SMTP about validating email addresses. Everything in the local part = U+0080 is permissible and up to the server to sort out what characters it wants to allow, how it wants to map things like Turkish I, etc. Some code points are clearly

Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread James Lin
Hi I am not expecting a single regular expression to solve all possible combination of scripts. What I am looking for probably (which may not be possible due to combination of scripts and mix scripts) is somewhere along the line of having individual scripts that validate by the regular

RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
Mixed script stuff considerations are all supposed to be done by the mailbox administrator. It's perfectly valid for a domain to assign Latin addresses and also Cyrillic ones. Indeed for Cyrillic EAI, one probably would almost certainly require ASCII (eg: Latin) aliases during whatever the

Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Philippe Verdy
You should not ttempt to detect scripts or even assume that they are encoded based on Unicode, in the username part ; all you can do is to break at the first @ to split it between user name part and the domin name, then use the IDN specs to validate the domain name part. * 1. Domain name part:

RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
For EAI (the question being asked), the entire address, local part and domain, are encoded in UTF-8. -Shawn From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Philippe Verdy Sent: Wednesday, October 30, 2013 4:08 PM To: James Lin Cc: Paweł Dyda;

Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Philippe Verdy
2013/10/31 Shawn Steele shawn.ste...@microsoft.com For EAI (the question being asked), the entire address, local part and domain, are encoded in UTF-8. No. the question being sked (by James Lin) did NOT include this restriction: does anyone has the best practice or guideline on how to