Damon Lynch wrote:
Hi,

I need to convert a lot of text sent in e-mails and MS Word documents
into plain text format, to be fed into a python script and then
e-mailed.  I want the final product to be plain ASCII text i.e. no fancy
em hyphens, curly quotes and so forth.

One big problem currently is that when I copy-n-paste characters like
curly quotes or em hyphens from OpenOffice.org into gedit or kate, the
characters show up obviously incorrect.  e.g. a capital A with a bar on
top.  When looking at them in python strings, these are some examples:
\xe2\x80\x99 (single quote)
xe2\x80[\x9c\x9d] (RE of opening and closing double quote)
\x93 (another curly quote)

Is there a utility program in Linux to convert these characters?  Or is
there a library in Python that will do it for me (instead of me using
RE's to substitute them)?

Many thanks,
Damon
Damon,

I don't mean to be facicious here, but the utility is called PERL. You should be able to read the file a line at a time, and filter each line through a regex looking for hex character data allowing only normal ascii characters to come out the other end of the regex.

Mark



Want to buy your Pack or Services from MandrakeSoft? 
Go to http://www.mandrakestore.com

Reply via email to