Re: [Python-projects] #4683: Non-ASCII characters count double if utf8 encode

Emile Anclin Tue, 30 Mar 2010 02:20:14 -0700

hi,

On Monday 29 March 2010 18:38:12 Alexander Artemenko wrote:
> >> On 2010/03/29 16:05 - svetlyak40wt wrote :
> >> I've solved this annoing problem. Here is the patch:
> >> http://gist.github.com/347854
> >
> > On 2010/03/29 16:57 - sthenault wrote :
> > would you please add a test case to the functional suite ?
> >
> > see test/input and test/messages or search ml archives for more
> > details
>
> Hi Sylvain, I've updated the patch and added tests.
>
>
> by: Alexander Artemenko
> url: http://www.logilab.org/ticket/4683


I tried out your patch; but unfortunately it generated an 
UnicodeDecodeError in our test suite

I fixed it without understanding, by splitting your lambda
declaration in two lines: 

+                decode = stream.readline().decode
+                line_generator = lambda: decode(encoding)
       
instead of:

+                 line_generator = lambda: 
stream.readline().decode(encoding)


Can somebody explain me what happened ?

Anyhow, Appended my new patch (we use func_noerror_* if we don't want the 
message triggered)

Is that ok ?

-- 

Emile Anclin <emile.anc...@logilab.fr>
http://www.logilab.fr/   http://www.logilab.org/ 
Informatique scientifique & et gestion de connaissances

fix #4683: Non-ASCII characters count double if utf8

diff -r cdd571901fea checkers/format.py
--- a/checkers/format.py	Mon Mar 29 11:27:19 2010 +0200
+++ b/checkers/format.py	Tue Mar 30 11:13:09 2010 +0200
@@ -31,6 +31,7 @@
 
 from pylint.interfaces import IRawChecker, IASTNGChecker
 from pylint.checkers import BaseRawChecker
+from pylint.checkers.misc import guess_encoding, is_ascii
 
 MSGS = {
     'C0301': ('Line too long (%s/%s)',
@@ -178,6 +179,25 @@
         self._lines = None
         self._visited_lines = None
 
+    def process_module(self, stream):
+        """extracts encoding from the stream and
+        decodes each line, so that international
+        text's lenght properly calculated.
+        """
+        data = stream.read()
+        line_generator = stream.readline
+
+        ascii, lineno = is_ascii(data)
+        if not ascii:
+            encoding = guess_encoding(data)
+            if encoding is not None:
+                decode = stream.readline().decode
+                line_generator = lambda: decode(encoding)
+        del data
+
+        stream.seek(0)
+        self.process_tokens(tokenize.generate_tokens(line_generator))
+
     def new_line(self, tok_type, line, line_num, junk):
         """a new line has been encountered, process it if necessary"""
         if not tok_type in junk:
diff -r cdd571901fea test/input/func_noerror_long_utf8_line.py
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test/input/func_noerror_long_utf8_line.py	Tue Mar 30 11:13:09 2010 +0200
@@ -0,0 +1,8 @@
+# -*- coding: utf-8 -*-
+"""this utf-8 doc string have some     non ASCII caracters like 'é', or '¢»ß'"""
+### check also comments with some     more non ASCII caracters like 'é' or '¢»ß'
+
+__revision__ = 1100
+print "------------------------------------------------------------------------"
+print "-----------------------------------------------------------------------é"
+

_______________________________________________
Python-Projects mailing list
Python-Projects@lists.logilab.org
http://lists.logilab.org/mailman/listinfo/python-projects

Re: [Python-projects] #4683: Non-ASCII characters count double if utf8 encode

Reply via email to