On 17.03.2016 01:29, Guido van Rossum wrote: > I've updated the PEP. Please review. I decided not to update the > Unicode howto (the thing is too obscure). Serhiy, you're probably in a > better position to fix the code looking for cookies to pick the first > one if there are two on the same line (or do whatever you think should > be done there).
Thanks, will do. > Should we recommend that everyone use tokenize.detect_encoding()? I'd prefer a separate utility for this somewhere, since tokenize.detect_encoding() is not available in Python 2. I've attached an example implementation with tests, which works in Python 2.7 and 3. > On Wed, Mar 16, 2016 at 5:05 PM, Guido van Rossum <gu...@python.org> wrote: >> On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg <m...@egenix.com> wrote: >>> The only reason to read up to two lines was to address the use of >>> the shebang on Unix, not to be able to define two competing >>> source code encodings :-) >> >> I know. I was just surprised that the PEP was sufficiently vague about >> it that when I found that mypy picked the second if there were two, I >> couldn't prove to myself that it was violating the PEP. I'd rather >> clarify the PEP than rely on the reasoning presented earlier here. I suppose it's a rather rare case, since it's the first time that I heard about anyone thinking that a possible second line could be picked - after 15 years :-) >> I don't like erroring out when there are two different cookies on two >> lines; I feel that the spirit of the PEP is to read up to two lines >> until a cookie is found, whichever comes first. >> >> I will update the regex in the PEP too (or change the wording to avoid >> "match"). >> >> I'm not sure what to do if there are two cooking on one line. If >> CPython currently picks the latter we may want to preserve that >> behavior. >> >> Should we recommend that everyone use tokenize.detect_encoding()? >> >> -- >> --Guido van Rossum (python.org/~guido) > > > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 17 2016) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ 2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
#!/usr/bin/python """ Utility to detect the source code encoding of a Python file. Marc-Andre Lemburg, 2016. Supports Python 2.7 and 3. """ import sys import re import codecs # Debug output ? _debug = True # PEP 263 RE PEP263 = re.compile(b'^[ \t]*#.*?coding[:=][ \t]*([-.a-zA-Z0-9]+)', re.MULTILINE) ### def detect_source_encoding(code, buffer_size=400): """ Detect and return the source code encoding of the Python code given in code. code must be given as bytes. The function uses a buffer to determine the first two code lines with a default size of 400 bytes/code points. This can be adjusted using the buffer_size parameter. """ # Get the first two lines first_two_lines = b'\n'.join(code[:buffer_size].splitlines()[:2]) # BOMs override any source code encoding comments if first_two_lines.startswith(codecs.BOM): return 'utf-8' # .search() picks the first occurrance m = PEP263.search(first_two_lines) if m is None: return 'ascii' return m.group(1).decode('ascii') # Tests def _test(): l = ( (b"""\ # No encoding """, 'ascii'), (b"""\ # coding: latin-1 """, 'latin-1'), (b"""\ #!/usr/bin/python # coding: utf-8 """, 'utf-8'), (b"""\ coding=123 # The above could be detected as source code encoding """, 'ascii'), (b"""\ # coding: latin-1 # coding: utf-8 """, 'latin-1'), (b"""\ # No encoding on first line # No encoding on second line # coding: utf-8 """, 'ascii'), (codecs.BOM + b"""\ # No encoding """, 'utf-8'), (codecs.BOM + b"""\ # BOM and encoding # coding: latin-1 """, 'utf-8'), ) for code, encoding in l: if _debug: print ('=' * 72) print ('Checking:') print ('-' * 72) print (code.decode('latin-1')) print ('-' * 72) detected_encoding = detect_source_encoding(code) if _debug: print ('detected: %s, expected: %s' % (detected_encoding, encoding)) assert detected_encoding == encoding if __name__ == '__main__': _test()
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com