Re: Abandoning work on PDF for ocropus

Thomas Breuel Wed, 05 Aug 2009 03:40:47 -0700

Just as a reminder: for PDF import or PDF text extraction, Poppler
doesn't need to be linked with OCRopus.  Both of those are operations
that are book-level and can be carried out on the book-level
representation.


Basically, for PDF input two commands are needed:

render-pdf-to-book-pages book/ input.pdf

Generates 300 dpi rendered pages from the input.pdf in OCRopus book format.

extract-pdf-text-for-layout book/ input.pdf

Looks through the page segmentation files (book/0001.pseg.png), finds
the characters in input.pdf that overlap text line bounding boxes, and
outputs them in left-to-right (or rtl) order into the corresponding
line text files (book/0001/010001.png).

With these two commands, we can handle PDF files that contain images
only, as well as digitally generated PDF files.  For the latter, the
sequence of commands would be:

render-pdf-to-book-pages book/ input.pdf
ocropus pages2lines book/
extract-pdf-text-for-layout book/ input.pdf
ocropus buildhtml book/ > output.html

Afterwards, you get reflowable HTML in reading order corresponding to
input.pdf.  Or you can generated tagged PDF once we have tagged PDF
generation implemented.

So, again, there is no need to link with OCRopus for PDF input support
and the license on the PDF library shouldn't make a difference.

Tom

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Abandoning work on PDF for ocropus

Reply via email to