On 17.03.2016 18:53, Serhiy Storchaka wrote: > On 17.03.16 19:23, M.-A. Lemburg wrote: >> On 17.03.2016 15:02, Serhiy Storchaka wrote: >>> On 17.03.16 15:14, M.-A. Lemburg wrote: >>>> On 17.03.2016 01:29, Guido van Rossum wrote: >>>>> Should we recommend that everyone use tokenize.detect_encoding()? >>>> >>>> I'd prefer a separate utility for this somewhere, since >>>> tokenize.detect_encoding() is not available in Python 2. >>>> >>>> I've attached an example implementation with tests, which works >>>> in Python 2.7 and 3. >>> >>> Sorry, but this code doesn't match the behaviour of Python interpreter, >>> nor other tools. I suggest to backport tokenize.detect_encoding() (but >>> be aware that the default encoding in Python 2 is ASCII, not UTF-8). >> >> Yes, I got the default for Python 3 wrong. I'll fix that. Thanks >> for the note. >> >> What other aspects are different than what Python implements ? > > 1. If there is a BOM and coding cookie, the source encoding is "utf-8-sig".
Ok, that makes sense (even though it's not mandated by the PEP; the utf-8-sig codec didn't exist yet). > 2. If there is a BOM and coding cookie is not 'utf-8', this is an error. It's an error for Python, but why should a detection function always raise an error for this case ? It would probably be a good idea to have an errors parameter to leave this to the use to decide. Same for unknown encodings. > 3. If the first line is not blank or comment line, the coding cookie is > not searched in the second line. Hmm, the PEP does allow having the coding cookie in the second line, even if the first line is not a comment. Perhaps that's not really needed. > 4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and > "utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM). Well, that's cosmetics :-) The codec system will take care of this when needed. > 5. There isn't the limit of 400 bytes. Actually there is a bug with > handling long lines in current code, but even with this bug the limit is > larger. I think it's a reasonable limit, since shebang lines may only be 127 long on at least Linux (and probably several other Unix systems as well). But just in case, I made this configurable :-) > 6. I made a mistake in the regular expression, missed the underscore. I added it. > tokenize.detect_encoding() is the closest imitation of the behavior of > Python interpreter. Probably, but that doesn't us on Python 2, right ? I'll upload the script to github later today or tomorrow to continue development. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 17 2016) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ 2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com