On Tue, 2008-05-20 at 08:28 -0700, Gary Herron wrote: > A_H wrote: > > Help! > > > > I've scraped a PDF file for text and all the minus signs come back as > > u'\xad'. > > > > Is there any easy way I can change them all to plain old ASCII '-' ??? > > > > str.replace complained about a missing codec. > > > > > > > > Hints? > > > > Encoding it into a 'latin1' encoded string seems to work: > > >>> print u'\xad'.encode('latin1') > - > > Here's what I've found:
>>> x = u'\xad' >>> x.replace('\xad','-') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xad in position 0: ordinal not in range(128) >>> x.replace(u'\xad','-') u'-' If you replace the *string* '\xad' in the first argument to replace with the *unicode object* u'\xad', python won't complain anymore. (Mind you, you weren't using str.replace. You were using unicode.replace. Slight difference, but important.) If you do the replace on a plain string, it doesn't have to convert anything, so you don't get a UnicodeDecodeError. >>> x = x.encode('latin1') >>> x '\xad' >>> # Note the lack of a u before the ' above. >>> x.replace('\xad','-') '-' >>> Cheers, Cliff -- http://mail.python.org/mailman/listinfo/python-list