On 11/25/2010 08:30 AM, Emile Anclin wrote:

hello,

working on Pylint, we have a lot of voluntary corrupted files to test
Pylint behavior; for instance

$ cat /home/emile/var/pylint/test/input/func_unknown_encoding.py
# -*- coding: IBO-8859-1 -*-
""" check correct unknown encoding declaration
"""

__revision__ = 'éééé'


and we try to find that module :
find_module('func_unknown_encoding', None). But python3 raises SyntaxError
in that case ; it didn't raise SyntaxError on python2 nor does so on our
func_nonascii_noencoding and func_wrong_encoding modules (with obvious
names)

Python 3.2a2 (r32a2:84522, Sep 14 2010, 15:22:36)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
from imp import find_module
find_module('func_unknown_encoding', None)
Traceback (most recent call last):
   File "<stdin>", line 1, in<module>
SyntaxError: encoding problem: with BOM
find_module('func_wrong_encoding', None)
(<_io.TextIOWrapper name=5 encoding='utf-8'>, 'func_wrong_encoding.py',
('.py', 'U', 1))
find_module('func_nonascii_noencoding', None)
(<_io.TextIOWrapper name=6 encoding='utf-8'>,
'func_nonascii_noencoding.py', ('.py', 'U', 1))


So what is the reason of this selective behavior?
Furthermore, there is BOM in our func_unknown_encoding.py module.

I don't think there is a clear reason by design. Also try importing the same modules directly and noting the differences in the errors you get.

For example, the problem that brought this to my attention in python3.2.

>>> find_module('test/badsyntax_pep3120')
Segmentation fault

>>> from test import badsyntax_pep3120
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.2/test/badsyntax_pep3120.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xf6' in file /usr/local/lib/python3.2/test/badsyntax_pep3120.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details


The import statement uses parser.c, and tokenizer.c indirectly, to import a file, but the imp module uses tokenizer.c directly. They aren't consistent in how they handle errors because the different error messages are generated in different places depending on what the error is, *and* what the code path to get to that point was, *and* weather or not a filename was set. For the example above with imp.findmodule(), the filename isn't set, so you get a different error than if you used import, which uses the parser module and that does set the filename.

From what I've seen, it would help if the imp module was rewritten to use parser.c like the import statement does, rather than tokenizer.c directly. The error handling in parser.c is much better than tokenizer.c. Possibly tokenizer.c could be cleaned up after that and be made much simpler.

Ron Adam














_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to