On 17.03.2016 01:29, Guido van Rossum wrote:
> I've updated the PEP. Please review. I decided not to update the
> Unicode howto (the thing is too obscure). Serhiy, you're probably in a
> better position to fix the code looking for cookies to pick the first
> one if there are two on the same line (or do whatever you think should
> be done there).

Thanks, will do.

> Should we recommend that everyone use tokenize.detect_encoding()?

I'd prefer a separate utility for this somewhere, since
tokenize.detect_encoding() is not available in Python 2.

I've attached an example implementation with tests, which works
in Python 2.7 and 3.

> On Wed, Mar 16, 2016 at 5:05 PM, Guido van Rossum <gu...@python.org> wrote:
>> On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg <m...@egenix.com> wrote:
>>> The only reason to read up to two lines was to address the use of
>>> the shebang on Unix, not to be able to define two competing
>>> source code encodings :-)
>>
>> I know. I was just surprised that the PEP was sufficiently vague about
>> it that when I found that mypy picked the second if there were two, I
>> couldn't prove to myself that it was violating the PEP. I'd rather
>> clarify the PEP than rely on the reasoning presented earlier here.

I suppose it's a rather rare case, since it's the first time
that I heard about anyone thinking that a possible second line
could be picked - after 15 years :-)

>> I don't like erroring out when there are two different cookies on two
>> lines; I feel that the spirit of the PEP is to read up to two lines
>> until a cookie is found, whichever comes first.
>>
>> I will update the regex in the PEP too (or change the wording to avoid 
>> "match").
>>
>> I'm not sure what to do if there are two cooking on one line. If
>> CPython currently picks the latter we may want to preserve that
>> behavior.
>>
>> Should we recommend that everyone use tokenize.detect_encoding()?
>>
>> --
>> --Guido van Rossum (python.org/~guido)
> 
> 
> 

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Mar 17 2016)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89
2016-02-19: Released eGenix PyRun 2.1.2 ...       http://egenix.com/go88

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/

#!/usr/bin/python
"""
    Utility to detect the source code encoding of a Python file.

    Marc-Andre Lemburg, 2016.

    Supports Python 2.7 and 3.

"""
import sys
import re
import codecs

# Debug output ?
_debug = True

# PEP 263 RE
PEP263 = re.compile(b'^[ \t]*#.*?coding[:=][ \t]*([-.a-zA-Z0-9]+)',
                    re.MULTILINE)

###

def detect_source_encoding(code, buffer_size=400):

    """ Detect and return the source code encoding of the Python code
        given in code.
        
        code must be given as bytes.
        
        The function uses a buffer to determine the first two code lines
        with a default size of 400 bytes/code points.  This can be adjusted
        using the buffer_size parameter.
        
    """
    # Get the first two lines
    first_two_lines = b'\n'.join(code[:buffer_size].splitlines()[:2])
    # BOMs override any source code encoding comments
    if first_two_lines.startswith(codecs.BOM):
        return 'utf-8'
    # .search() picks the first occurrance
    m = PEP263.search(first_two_lines)
    if m is None:
        return 'ascii'
    return m.group(1).decode('ascii')

# Tests

def _test():

    l = (
  (b"""\
# No encoding
""", 'ascii'),
  (b"""\
# coding: latin-1
""", 'latin-1'),
  (b"""\
#!/usr/bin/python
# coding: utf-8
""", 'utf-8'),
  (b"""\
coding=123
# The above could be detected as source code encoding
""", 'ascii'),
  (b"""\
# coding: latin-1
# coding: utf-8
""", 'latin-1'),
  (b"""\
# No encoding on first line
# No encoding on second line
# coding: utf-8
""", 'ascii'),
  (codecs.BOM + b"""\
# No encoding
""", 'utf-8'),
  (codecs.BOM + b"""\
# BOM and encoding
# coding: latin-1
""", 'utf-8'),
    )
    for code, encoding in l:
        if _debug:
            print ('=' * 72)
            print ('Checking:')
            print ('-' * 72)
            print (code.decode('latin-1'))
            print ('-' * 72)
        detected_encoding = detect_source_encoding(code)
        if _debug:
            print ('detected: %s, expected: %s' % 
                   (detected_encoding, encoding))
        assert detected_encoding == encoding

if __name__ == '__main__':
    _test()
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to