[issue1170] shlex have problems with parsing unicode

Andrew Jewett Thu, 15 Sep 2011 04:25:22 -0700

Andrew Jewett <[email protected]> added the comment:

Proposed solution and patch to follow.  Please let me know if I am posting it 
in the wrong place.


The main problem with shlex is that the shlex interface is inadequate to handle 
unicode.  Specifically it is no longer feasible to provide a list of every 
possible character that the user could want to appear within a token.  Suppose 
the user wants the ability to parse words in simplified Chinese.  If I 
understand correctly, then currently, they would have to set "self.wordchars" 
to a string (or some other container) of 6000 (unicode) characters, and this 
enormous string would need to be searched each time a new character is read.  
This was a problem with shlex from the beginning, but it became more acute when 
support for unicode was added.  Generally, in some cases, it is much more 
convenient instead to specify a short list of characters you -don't- want to 
appear in a word (word delimiters), than to list all the characters you do.

An obvious (although perhaps not optimal) solution is to add an additional data 
member to shlex, consisting of the characters which terminate the reading of a 
token.  (In other words, the set-inverse of wordchars.)  In the attached 
example code, I call it "self.wordterminators".  To remain backwards-compatible 
with shlex, self.wordterminators is empty by default.  But if not-empty, 
self.wordterminators overrides self.wordchars.

I've been distributing a customized version of shlex with my own software which 
implements this modest change (shlex_wt).  (See attachment.)  It is otherwise 
identical to the version of shlex.py that ships with python 3.2.2.  (It has 
been further modified only slightly to be compatible with both python 2.7 and 
python 3.)  It's not beautiful code, but it seems to be a successful kluge for 
this particular issue.  I don't know if it makes a worthy patch, but perhaps 
somebody out there finds it useful.  To make it easy to spot the changes, each 
of the lines I changed ends in a comment "#WORDTERMINATORS".  (There are only 
15 of these lines.)
-Andrew Jewett

----------
nosy: +wombat
versions:  -Python 2.7
Added file: http://bugs.python.org/file23161/shlex_wt.py

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue1170>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1170] shlex have problems with parsing unicode

Reply via email to