Bugs item #1178484, was opened at 2005-04-07 14:33
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1178484&group_id=5470

Category: Parser/Compiler
Group: Python 2.4
Status: Open
Resolution: Accepted
Priority: 5
Submitted By: Timo Linna (tilinna)
Assigned to: Martin v. Löwis (loewis)
Summary: Erroneous line number error in Py2.4.1

Initial Comment:
For some reason Python 2.3.5 reports the error in the 
following program correctly: 

  File "C:\Temp\problem.py", line 7 
SyntaxError: unknown decode error 

..whereas Python 2.4.1 reports an invalid line number: 

  File "C:\Temp\problem.py", line 2 
SyntaxError: unknown decode error 

----- problem.py starts ----- 
# -*- coding: ascii -*- 

""" 
Foo bar 
""" 

# Ä is not allowed in ascii coding 
----- problem.py ends -----

Without the encoding declaration both Python versions 
report the usual deprecation warning (just like they 
should be doing). 

My environment: Windows 2000 + SP3. 


----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2005-05-18 11:31

Message:
Logged In: YES 
user_id=38388

Walter, as I've said before: I know that you need buffering
for the UTF-x readline support, but I don't see a
requirement for it in most of the other codecs. E.g. an
ascii codec or latin-1 codec will only ever see standard
line ends (not Unicode ones), so the streams .readline()
method can be used directly - just like we did before the
buffering code was added.

Your argument about applications making implications on the
file position after having used .readline() is true, but
still many applications rely on this behavior which is not
as far fetched as it may seem given that they normally only
expect 8-bit data.

Wouldn't it make things a lot safer if we only use buffering
per default in the UTF-x codecs and revert back to the old
non-buffered behavior for the other codecs which has worked
well in the past ?!

About your patch:

* Please explain what firstline is supposed to do
(preferably in the doc-string).
* Why is firstline always set in .readline() ?
* Please remove the print repr()
* You cannot always be sure that exc has a .start attribute,
so you need to accomocate for this situation as well


----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-05-17 18:50

Message:
Logged In: YES 
user_id=89016

It isn't the buffering support per se that breaks the
tokenizer. This problem exists even in Python 2.3.x (Simply
try the test scripts from http://www.python.org/sf/1089395
with Python 2.3.5 and you'll get a segfault). Applications
that rely on len(readline(x)) == x or anything similar are
broken anyway. Supporting buffered and unbuffered reading
would mean keeping the 2.3 mode of doing things around
indefinitely, and we'd loose readline() support for UTF-16
again.

BTW, applying Greg Chapman's patch
(http://www.python.org/sf/1101726, which fixes the
tokenizer) together with this one seems to fix the problem
from my previous post. So if you could give
http://www.python.org/sf/1101726 a third look, so we can get
it into 2.4.2, this would be great.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-05-17 11:13

Message:
Logged In: YES 
user_id=38388

Walter, I think that instead of trying to get the tokenizer
to work with the buffer support in the codecs, you should
add a flag that allows to switch off the buffer support in
the codecs altogether and then use the unbuffered mode
codecs in the tokenizer.

I expect that other applications will run into the same kind
of problem, so it should be possible to switch off buffering
if needed (maybe we should make this the default ?!).

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-05-16 10:35

Message:
Logged In: YES 
user_id=89016

OK, here is a patch. It adds an additional argument
firstline to read(). If this argument is true (i.e. if
called from readline()) and a decoding error happens, this
error will only be reported if it is in the first line.
Otherwise read() will decode up to the error position and
put the rest in the bytebuffer.

Unfortunately with this patch, I get a segfault with the
following stacktrace if I run the test. I don't know if this
is related to bug #1089395/patch #1101726. Martin, can you
take a look?

#0  0x08057ad1 in tok_nextc (tok=0x81ca7b0) at tokenizer.c:719
#1  0x08058558 in tok_get (tok=0x81ca7b0,
p_start=0xbffff3d4, p_end=0xbffff3d0) at tokenizer.c:1075
#2  0x08059331 in PyTokenizer_Get (tok=0x81ca7b0,
p_start=0xbffff3d4, p_end=0xbffff3d0) at tokenizer.c:1466
#3  0x080561b1 in parsetok (tok=0x81ca7b0, g=0x8167980,
start=257, err_ret=0xbffff440, flags=0) at parsetok.c:125
#4  0x0805613c in PyParser_ParseFileFlags (fp=0x816bdb8,
filename=0xbffff7b7 "./bug.py", g=0x8167980, start=257,
ps1=0x0, ps2=0x0, 
    err_ret=0xbffff440, flags=0) at parsetok.c:90
#5  0x080f3926 in PyParser_SimpleParseFileFlags
(fp=0x816bdb8, filename=0xbffff7b7 "./bug.py", start=257,
flags=0)
    at pythonrun.c:1345
#6  0x080f352b in PyRun_FileExFlags (fp=0x816bdb8,
filename=0xbffff7b7 "./bug.py", start=257, globals=0xb7d62e94, 
    locals=0xb7d62e94, closeit=1, flags=0xbffff544) at
pythonrun.c:1239
#7  0x080f22f2 in PyRun_SimpleFileExFlags (fp=0x816bdb8,
filename=0xbffff7b7 "./bug.py", closeit=1, flags=0xbffff544)
    at pythonrun.c:860
#8  0x080f1b16 in PyRun_AnyFileExFlags (fp=0x816bdb8,
filename=0xbffff7b7 "./bug.py", closeit=1, flags=0xbffff544)
    at pythonrun.c:664
#9  0x08055e45 in Py_Main (argc=2, argv=0xbffff5f4) at
main.c:484
#10 0x08055366 in main (argc=2, argv=0xbffff5f4) at python.c:23

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-04-07 16:28

Message:
Logged In: YES 
user_id=89016

The reason for this is the new codec buffering code in 2.4:
The codec might read and decode more data from the byte
stream than is neccessary for decoding one line. I.e. when
reading line n, the codec might decode bytes that belong to
line n+1, n+2 etc. too. If there's a decoding error in this
data, line n gets reported. I don't think there's a simple
fix for this.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1178484&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to