On Fri, 10 Jan 2003 15:27, Damon Lynch wrote: > Hi, > > I need to convert a lot of text sent in e-mails and MS Word documents > into plain text format, to be fed into a python script and then > e-mailed. I want the final product to be plain ASCII text i.e. no fancy > em hyphens, curly quotes and so forth. > > One big problem currently is that when I copy-n-paste characters like > curly quotes or em hyphens from OpenOffice.org into gedit or kate, the > characters show up obviously incorrect. e.g. a capital A with a bar on > top. When looking at them in python strings, these are some examples: > \xe2\x80\x99 (single quote) > xe2\x80[\x9c\x9d] (RE of opening and closing double quote) > \x93 (another curly quote) > > Is there a utility program in Linux to convert these characters? Or is > there a library in Python that will do it for me (instead of me using > RE's to substitute them)? > > Many thanks, > Damon
Could be something like demoroniser you are looking for. http://www.fourmilab.ch/webtools/demoroniser/ This is designed specifically for HTML pages. But if you have any experience in perl you could hack it to work on the text saves of openoffice. -- Michael
Want to buy your Pack or Services from MandrakeSoft? Go to http://www.mandrakestore.com
