"Marc 'BlackJack' Rintsch" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > On Tue, 04 Mar 2008 10:49:54 +0530, Pradnyesh Sawant wrote: > >> I have a file which contains chinese characters. I just want to find out >> all the places that these chinese characters occur. >> >> The following script doesn't seem to work :( >> >> ********************************************************************** >> class RemCh(object): >> def __init__(self, fName): >> self.pattern = re.compile(r'[\u2F00-\u2FDF]+') >> fp = open(fName, 'r') >> content = fp.read() >> s = re.search('[\u2F00-\u2fdf]', content, re.U) >> if s: >> print s.group(0) >> if __name__ == '__main__': >> rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php') >> ********************************************************************** >> >> the php file content is something like the following: >> >> ********************************************************************** >> // Check if the folder still has subscribed blogs >> $subCount = function1($param1, $param2); >> if ($subCount > 0) { >> $errors['summary'] = 'æÂï½ æ½å¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨'; >> $errorMessage = 'æÂï½ æ½å¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨'; >> } > > Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should > decode `content` to unicode before searching the chinese characters. >
I couldn't get your data to decode into anything resembling Chinese, so I created my own file as an example. If reading an encoded text file, it comes in as just a bunch of bytes: >>> print open('chinese.txt','r').read() 我是美国人。 WÇ’ shì MÄ›iguórén. I am an American. Garbage, because the encoding isn't known. Provide the correct encoding and decode it to Unicode: >>> print open('chinese.txt','r').read().decode('utf8') 我是美国人。 Wǒ shì Měiguórén. I am an American. Here's the Unicode string. Note the 'u' before the quotes to indicate Unicode. >>> s=open('chinese.txt','r').read().decode('utf8') >>> s u'\ufeff\u6211\u662f\u7f8e\u56fd\u4eba\u3002 W\u01d2 sh\xec M\u011bigu\xf3r\xe9n. I am an American.' If working with Unicode strings, the re module should be provided Unicode strings also: >>> print re.search(ur'[\u4E00-\u9FA5]',s).group(0) 我 >>> print re.findall(ur'[\u4E00-\u9FA5]',s) [u'\u6211', u'\u662f', u'\u7f8e', u'\u56fd', u'\u4eba'] Hope that helps you. --Mark -- http://mail.python.org/mailman/listinfo/python-list