RE: String handling broken?

Fuzzier Wed, 27 Jun 2007 18:46:12 -0700

Thanks a lot!
The problem is solved... partially indeed.

Now I know that the possible solutions are:
1. Explicitly convert all string variables to unicode by using unicode() 
function. But I doubt that this solution is little slow and makes code harder 
to write.
2. Implicitly force functions such as os.listdir() return unicode strings by 
passing unicode arguments. But it's a little ridiculous though, that I wonder 
for some functions that require more than one argument, if I pass one as 
unicode and another as ansii, what will happen?.


But why is the problem partially solved?
If I don't used unicode at all, which means the script is saved in an ansii 
text file, and encoding is specified explicitly (in this case, 'shift-jis').
I made another script that outputs internal representations of the strings, and 
let's see what I've found.
################ Script ############################################
# -*- encoding: shift_jis -*-
# script5.py (saved as ANSI text file)

import os, re

def rename():
        pattern = 'パイソン\.txt'   # ANSI
        print 'pattern: ', repr(pattern)

        myre = re.compile(pattern)
        for f in os.listdir('.'):
                m = myre.match(f)
                if m != None: print repr(f), ': match!'
                else: print repr(f), ': doesn\'t match!'

rename()
################# Output ###########################################
pattern:  '\x83p\x83C\x83\\\x83\x93\\.txt'
'\x83p\x83C\x83\\\x83\x93.txt' : doesn't match!

As we can see that there is a '\\' inside the internal representation of the 
pattern string and the file name as well.
I think this is why a match is not possible: the interpretor perceives this 
'\\' as the start of an escape sequence, rather than something it should be --- 
the second byte of a MBCS character.
It's a bug, I think?

In Activestate Python 2.5 documentation, I find this:'On systems whose native 
character set is not ASCII, strings may use EBCDIC in their internal 
representation, provided the functions chr() and ord() implement a mapping 
between ASCII and EBCDIC, and string comparison preserves the ASCII order. Or 
perhaps someone can propose a better rule?'.
So string mathing is based on the rather messy internal representation, not 
something uni-code. I mean strings that look the same  externally (on stdin), 
are indeed perceived very differently internally, yet the matching of the 
strings is based on their internal representations, but not handled as they 
should be (as their external representations).
BTW, Perl on the other hand handles strings quite well. Plus Python interpretor 
doesn't recognize unicode text files saved by Notepad, but I hope that such 
feature can be presented in the future to prevent confusion, and boost 
performance as we don't have to convert anything to unciode (at least on 
Windows NT system ;)).

Thanks again!

_______________________________________________
ActivePython mailing list
ActivePython@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Other options: http://listserv.ActiveState.com/mailman/listinfo/ActivePython

RE: String handling broken?

Reply via email to