Re: [Dorset] Manipulating PDF Files in Linux

Terry Coles Fri, 18 Jul 2014 08:12:06 -0700

On Friday 18 Jul 2014 15:35:58 Terry Coles wrote:
> On Friday 18 Jul 2014 15:02:50 Andrew Montgomery-Hurrell wrote:
> > You can try pdftohtml[1] to get it into HTML format, from there it should
> > be easier to convert into a document format you want using something like
> > pandoc[2].
> > 
> > [1]: http://pdftohtml.sourceforge.net/
> > [2]: http://johnmacfarlane.net/pandoc/README.html
> 
> That was incredibly fast :-)
> 
> Unfortunately, for some reason the conversion inserted lots of unprintable
> chars; eg :
> 
>       PRÃ‰AMBULE
> 
> pdftotext did a better job of accurately converting the text, but lost all
> the formatting :-(


What worked was:

        pdftohtml -c -s <pdffile>

It did a lovely job of retaining both the formatting and the text.  When I get 
to work on Monday, I'll see if the translation company can work with the HTML.  
If not, I'll see what happens if I open the HTML in Word (LibreOffice 
demolished 
it).

-- 
        
        Terry Coles

        

-- 
Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread on mailing list:  mailto:[email protected]
How to Report Bugs Effectively:  http://goo.gl/4Xue

Re: [Dorset] Manipulating PDF Files in Linux

Reply via email to