On Mon, 26 Jan 2009 23:39:06 +0100, cpghost <cpgh...@cordula.ws> wrote:
> Those PDFs are usually scanned,
> and the scanner software (usually on Windows) assembles all screenshots
> into a PDF of images.

Handy for printing, but not for OCR postprocessing.

> That's what you find on the Net.

On the Web. :-)

> This is not such a bad idea, esp. when it comes to technical textbooks,
> which usually contain a lot of diagrams, formulae, tables etc...; since
> an OCR software that would be able to reverse all this into LaTeX and
> EPS figures has yet to be programmed (that's a difficult task).

As I've already mentioned, scanning the characters is only one part.
Your example of diagrams and formulas is good to illustrate this.
And because LaTeX is the only professional typesetting system
(and no, "Word" isn't such a tool), it would be really great to
have a tool pdf2tex which would get the characters of the text,
typeset them as in the original (paragraphing, hyphenation etc.),
input embedded pictures as pictures (of course), re-create
formulas so the result would run through pdf-LaTeX and
produce an improved version of the source PDF file.

But that's a task for the next generation of mankind. :-)

> Some PDFs encode the fonts
> in a special section, and then use text (sometimes compressed
> or encrypted), which refers to those fonts. In such a case, you
> could extract the pure text from the PDF.

It's worth mentioning that if the original text has characters
(represented in the additionally stored fonts) that have special
accents or orientations (non-english languages usually), the
target system needs to support them, which it usually does through
the means of UTF-8.

> Other PDFs simply encode the book as a set of bitmaps (see above);
> and then your only chance is to find an OCR software that would not
> only be able to recognize the characters in the bitmaps, but also
> to cope with those Fraktur- or other exotic fonts.

Yes, das Doytsh Uberfrucktoor makes everything unreadable. :-)

It gets even more complicated with hand-written books...

> Some OCR programs
> are interactive and trainable, so that you can say: this is an 'S',
> and that is a 'T'..., but AFAIK, there's no free and open source
> OCR program with this capability (yet).

Wow, never heared of this concept, but really intelligent solution.
If this really works, it still has the "disadvantage" of needing
much time for training the program, and postprocessing.

It's easier to \usepackage[german]{uberfraktur} to make the text
unreadable again. :-)

>From Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
freebsd-questions@freebsd.org mailing list
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"

Reply via email to