Re: Unicode to ASCII

Bob Proulx Fri, 19 Feb 2021 19:08:40 -0800

Todd Gruhn wrote:
> I extracted the "text" from a large PDF using a NetBSD prog called
> pdftotext(1).


pdftotext is really awesome.  I find "pdftotext -layout" to do a truly
excellent job with most PDF files I need to deal with from banks and
things here.

> I got the desired ASCII text, but it has many occurances of the sequence
> \x{80}\x{9c} ... \x{80}\x{9d}

Do you know what charset that is in natively?

> Is there a nice and universal utility that can convert these to ASCII chars?
> Someone mentioned EMACS... What about in pkgsrc?

I'll be honest and say I did not look but on another system I am using
"iconv" for this type of thing routinely.  I will cross my fingers and
hope it is available in pkgsrc.

    iconv -f UTF-8 -t ASCII//TRANSLIT <filein >fileout

That's assuming UTF-8 in and ASCII out but you will probably want some
other code set like this or another code page.

    iconv -f CP1252 -t UTF-8 <filein >fileout

Hopefully even if incomplete it might still be useful.

Bob

Re: Unicode to ASCII

Reply via email to