Damon Lynch wrote:
Damon,Hi,I need to convert a lot of text sent in e-mails and MS Word documents into plain text format, to be fed into a python script and then e-mailed. I want the final product to be plain ASCII text i.e. no fancy em hyphens, curly quotes and so forth. One big problem currently is that when I copy-n-paste characters like curly quotes or em hyphens from OpenOffice.org into gedit or kate, the characters show up obviously incorrect. e.g. a capital A with a bar on top. When looking at them in python strings, these are some examples: \xe2\x80\x99 (single quote) xe2\x80[\x9c\x9d] (RE of opening and closing double quote) \x93 (another curly quote) Is there a utility program in Linux to convert these characters? Or is there a library in Python that will do it for me (instead of me using RE's to substitute them)? Many thanks, Damon
I don't mean to be facicious here, but the utility is called PERL. You should be able to read the file a line at a time, and filter each line through a regex looking for hex character data allowing only normal ascii characters to come out the other end of the regex.
Mark
Want to buy your Pack or Services from MandrakeSoft? Go to http://www.mandrakestore.com
