On 17.03.16 16:55, Guido van Rossum wrote:
On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storch...@gmail.com> wrote:
Should we recommend that everyone use tokenize.detect_encoding()?

Likely. However the interface of tokenize.detect_encoding() is not very
simple.

I just found that out yesterday. You have to give it a readline()
function, which is cumbersome if all you have is a (byte) string and
you don't want to split it on lines just yet. And the readline()
function raises SyntaxError when the encoding isn't right. I wish
there were a lower-level helper that just took a line and told you
what the encoding in it was, if any. Then the rest of the logic can be
handled by the caller (including the logic of trying up to two lines).

The simplest way to detect encoding of bytes string:

    lines = data.splitlines()
    encoding = tokenize.detect_encoding(iter(lines).__next__)[0]

If you don't want to split all data on lines, the most efficient way in Python 3.5 is:

    encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0]

In Python 3.5 io.BytesIO(data) has constant complexity.

In older versions for detecting encoding without copying data or splitting all data on lines you should write line iterator. For example:

    def iterlines(data):
        start = 0
        while True:
            end = data.find(b'\n', start) + 1
            if not end:
                break
            yield data[start:end]
            start = end
        yield data[start:]

    encoding = tokenize.detect_encoding(iterlines(data).__next__)[0]

or

    it = (m.group() for m in re.finditer(b'.*\n?', data))
    encoding = tokenize.detect_encoding(it.__next__)

I don't know what approach is more efficient.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to