On Thu, Mar 17, 2016 at 9:50 AM, Serhiy Storchaka <storch...@gmail.com> wrote: > On 17.03.16 16:55, Guido van Rossum wrote: >> >> On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storch...@gmail.com> >> wrote: >>>> >>>> Should we recommend that everyone use tokenize.detect_encoding()? >>> >>> >>> Likely. However the interface of tokenize.detect_encoding() is not very >>> simple. >> >> >> I just found that out yesterday. You have to give it a readline() >> function, which is cumbersome if all you have is a (byte) string and >> you don't want to split it on lines just yet. And the readline() >> function raises SyntaxError when the encoding isn't right. I wish >> there were a lower-level helper that just took a line and told you >> what the encoding in it was, if any. Then the rest of the logic can be >> handled by the caller (including the logic of trying up to two lines). > > > The simplest way to detect encoding of bytes string: > > lines = data.splitlines() > encoding = tokenize.detect_encoding(iter(lines).__next__)[0]
This will raise SyntaxError if the encoding is unknown. That needs to be caught in mypy's case and then it needs to get the line number from the exception. I tried this and it was too painful, so now I've just changed the regex that mypy uses to use non-eager matching (https://github.com/python/mypy/commit/b291998a46d580df412ed28af1ba1658446b9fe5). > If you don't want to split all data on lines, the most efficient way in > Python 3.5 is: > > encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0] > > In Python 3.5 io.BytesIO(data) has constant complexity. Ditto with the SyntaxError though. > In older versions for detecting encoding without copying data or splitting > all data on lines you should write line iterator. For example: > > def iterlines(data): > start = 0 > while True: > end = data.find(b'\n', start) + 1 > if not end: > break > yield data[start:end] > start = end > yield data[start:] > > encoding = tokenize.detect_encoding(iterlines(data).__next__)[0] > > or > > it = (m.group() for m in re.finditer(b'.*\n?', data)) > encoding = tokenize.detect_encoding(it.__next__) > > I don't know what approach is more efficient. Having my own regex was simpler. :-( -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com