[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file

Karthikeyan Singaravelan Sun, 14 Oct 2018 02:11:18 -0700

Karthikeyan Singaravelan <[email protected]> added the comment:

Got it. Thanks for the details and patience. I tested with less number of 
characters and it seems to work fine so using the encoding at the top is not a 
good way to test the original issue as you have mentioned. Then I searched 
around and found issue14811 with test. This seems to be a very similar issue 
and there is a patch to detect this scenario to throw SyntaxError that the line 
is longer than the internal buffer instead of an encoding related error. I 
applied the patch to master and it throws an error about the internal buffer 
length as expected. But the patch was not applied and it seems Victor had 
another solution in mind as per msg167154. I tested with the patch as below :


# master

➜  cpython git:(master) cat ../backups/bpo34979.py

s = 
'测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
  File "../backups/bpo34979.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xe8' in file ../backups/bpo34979.py 
on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ 
for details


# Applying the patch file from issue14811

➜  cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py
  File "../backups/bpo34979.py", line 2
SyntaxError: Line 2 of file ../backups/bpo34979.py is longer than the internal 
buffer (1024)

# Patch on master

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index fc75bae537..48b3ac0ee9 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -586,6 +586,7 @@ static char *
 decoding_fgets(char *s, int size, struct tok_state *tok)
 {
     char *line = NULL;
+    size_t len;
     int badchar = 0;
     for (;;) {
         if (tok->decoding_state == STATE_NORMAL) {
@@ -597,6 +598,15 @@ decoding_fgets(char *s, int size, struct tok_state *tok)
             /* We want a 'raw' read. */
             line = Py_UniversalNewlineFgets(s, size,
                                             tok->fp, NULL);
+           if (line != NULL) {
+                len = strlen(line);
+                if (1 < len && line[len-1] != '\n') {
+                    PyErr_Format(PyExc_SyntaxError,
+                            "Line %i of file %U is longer than the internal 
buffer (%i)",
+                                tok->lineno + 1, tok->filename, size);
+                    return error_ret(tok);
+                }
+            }
             break;
         } else {
             /* We have not yet determined the encoding.


If it's the same issue then I think closing this issue and discussing there 
will be good since the issue has a patch with test and relevant discussion. 
Also it seems BUFSIZ is platform dependent so adding your platform details 
would also help.

TIL about difference Python 2 and 3 on handling unicode related files. Thanks 
again!

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34979>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file

Reply via email to