Karthikeyan Singaravelan <[email protected]> added the comment:
Got it. Thanks for the details and patience. I tested with less number of
characters and it seems to work fine so using the encoding at the top is not a
good way to test the original issue as you have mentioned. Then I searched
around and found issue14811 with test. This seems to be a very similar issue
and there is a patch to detect this scenario to throw SyntaxError that the line
is longer than the internal buffer instead of an encoding related error. I
applied the patch to master and it throws an error about the internal buffer
length as expected. But the patch was not applied and it seems Victor had
another solution in mind as per msg167154. I tested with the patch as below :
# master
➜ cpython git:(master) cat ../backups/bpo34979.py
s =
'测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'
print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜ cpython git:(master) ./python.exe ../backups/bpo34979.py
File "../backups/bpo34979.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xe8' in file ../backups/bpo34979.py
on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/
for details
# Applying the patch file from issue14811
➜ cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py
File "../backups/bpo34979.py", line 2
SyntaxError: Line 2 of file ../backups/bpo34979.py is longer than the internal
buffer (1024)
# Patch on master
diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index fc75bae537..48b3ac0ee9 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -586,6 +586,7 @@ static char *
decoding_fgets(char *s, int size, struct tok_state *tok)
{
char *line = NULL;
+ size_t len;
int badchar = 0;
for (;;) {
if (tok->decoding_state == STATE_NORMAL) {
@@ -597,6 +598,15 @@ decoding_fgets(char *s, int size, struct tok_state *tok)
/* We want a 'raw' read. */
line = Py_UniversalNewlineFgets(s, size,
tok->fp, NULL);
+ if (line != NULL) {
+ len = strlen(line);
+ if (1 < len && line[len-1] != '\n') {
+ PyErr_Format(PyExc_SyntaxError,
+ "Line %i of file %U is longer than the internal
buffer (%i)",
+ tok->lineno + 1, tok->filename, size);
+ return error_ret(tok);
+ }
+ }
break;
} else {
/* We have not yet determined the encoding.
If it's the same issue then I think closing this issue and discussing there
will be good since the issue has a patch with test and relevant discussion.
Also it seems BUFSIZ is platform dependent so adding your platform details
would also help.
TIL about difference Python 2 and 3 on handling unicode related files. Thanks
again!
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34979>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com