On Friday 18 Jul 2014 16:35:11 John Palmer wrote: > Terry, I've used pdftotext to good effect. > pdftotext -layout will do a lot of what you want, but tabular matter is > always a difficulty because it isn't a simple sequence of words.
Yes. Tables and images were a problem when I tried pdftotext. > You say 'because [the documents] need translating'; i.e. to another > language or languages? or have I misunderstood? Yes. From French. > Are there many illustrations and of what character? Are they needed in > the translated version(s) and if so will they need altering? Yes. But I'm more concerned with getting at the text inside the Tables without too much re-interpretation of what is what. > There may be some mileage (and also some work) in converting the > documents to a notation in which the logical structure (rather than the > actual layout on the page) is indicated by mark-up. I'm thinking mainly > of LaTeX and its friends, though [X]HTML (used properly) has this > character too. Then all you have to do is to swap the text of each > English paragraph or other unit of text (caption, for instance) for a > Spanish (etc) text and the final layout adjusts itself to fit. > If the target language is right-to-left or ideographic it's harder but > still within the scope of TeX. Yes. As mentioned in my earlier post, I was able to get pdftohtml to do an excellent job. -- Terry Coles -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue