Thanks a lot! The problem is solved... partially indeed. Now I know that the possible solutions are: 1. Explicitly convert all string variables to unicode by using unicode() function. But I doubt that this solution is little slow and makes code harder to write. 2. Implicitly force functions such as os.listdir() return unicode strings by passing unicode arguments. But it's a little ridiculous though, that I wonder for some functions that require more than one argument, if I pass one as unicode and another as ansii, what will happen?.
But why is the problem partially solved? If I don't used unicode at all, which means the script is saved in an ansii text file, and encoding is specified explicitly (in this case, 'shift-jis'). I made another script that outputs internal representations of the strings, and let's see what I've found. ################ Script ############################################ # -*- encoding: shift_jis -*- # script5.py (saved as ANSI text file) import os, re def rename(): pattern = 'パイソン\.txt' # ANSI print 'pattern: ', repr(pattern) myre = re.compile(pattern) for f in os.listdir('.'): m = myre.match(f) if m != None: print repr(f), ': match!' else: print repr(f), ': doesn\'t match!' rename() ################# Output ########################################### pattern: '\x83p\x83C\x83\\\x83\x93\\.txt' '\x83p\x83C\x83\\\x83\x93.txt' : doesn't match! As we can see that there is a '\\' inside the internal representation of the pattern string and the file name as well. I think this is why a match is not possible: the interpretor perceives this '\\' as the start of an escape sequence, rather than something it should be --- the second byte of a MBCS character. It's a bug, I think? In Activestate Python 2.5 documentation, I find this:'On systems whose native character set is not ASCII, strings may use EBCDIC in their internal representation, provided the functions chr() and ord() implement a mapping between ASCII and EBCDIC, and string comparison preserves the ASCII order. Or perhaps someone can propose a better rule?'. So string mathing is based on the rather messy internal representation, not something uni-code. I mean strings that look the same externally (on stdin), are indeed perceived very differently internally, yet the matching of the strings is based on their internal representations, but not handled as they should be (as their external representations). BTW, Perl on the other hand handles strings quite well. Plus Python interpretor doesn't recognize unicode text files saved by Notepad, but I hope that such feature can be presented in the future to prevent confusion, and boost performance as we don't have to convert anything to unciode (at least on Windows NT system ;)). Thanks again! _______________________________________________ ActivePython mailing list ActivePython@listserv.ActiveState.com To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs Other options: http://listserv.ActiveState.com/mailman/listinfo/ActivePython