Just as a reminder: for PDF import or PDF text extraction, Poppler doesn't need to be linked with OCRopus. Both of those are operations that are book-level and can be carried out on the book-level representation.
Basically, for PDF input two commands are needed: render-pdf-to-book-pages book/ input.pdf Generates 300 dpi rendered pages from the input.pdf in OCRopus book format. extract-pdf-text-for-layout book/ input.pdf Looks through the page segmentation files (book/0001.pseg.png), finds the characters in input.pdf that overlap text line bounding boxes, and outputs them in left-to-right (or rtl) order into the corresponding line text files (book/0001/010001.png). With these two commands, we can handle PDF files that contain images only, as well as digitally generated PDF files. For the latter, the sequence of commands would be: render-pdf-to-book-pages book/ input.pdf ocropus pages2lines book/ extract-pdf-text-for-layout book/ input.pdf ocropus buildhtml book/ > output.html Afterwards, you get reflowable HTML in reading order corresponding to input.pdf. Or you can generated tagged PDF once we have tagged PDF generation implemented. So, again, there is no need to link with OCRopus for PDF input support and the license on the PDF library shouldn't make a difference. Tom --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
