hOCR to PDF with Python

Jonathan Brinley Mon, 06 Apr 2009 06:34:36 -0700

Building off of Florian Hackenberger's Java-based converter (http://
groups.google.com/group/ocropus/browse_thread/thread/
3cf464bda5807952), I've built a small Python script to convert hOCR
documents to PDF. See http://xplus3.net/2009/04/02/convert-hocr-to-pdf/#more-207
for info and to download.


It can either be called from the command line:

$ python HocrConverter.py myHocrFile.html myImageFile.png output.pdf

or imported into a Python script:

from HocrConverter import HocrConverter
hocr = HocrConverter("myHocrFile.html")
hocr.to_text("output.txt")
hocr.to_pdf("myImageFile.png", "output.pdf")

The main differences between this script and Mr. Hackenberger's
script:
1. This stretches lines of text horizontally to fill the bounding box
2. This requires you to specify an image to use, rather than using the
image indicated in the hOCR file (in case you want to use a different
resolution image for the PDF)
3. This can output either PDF or plain text

Please let me know how it works for you. I'd welcome any suggestions
or contributions.

Have a nice day,
Jonathan



--
Jonathan M. Brinley

[email protected]
http://xplus3.net/


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

hOCR to PDF with Python

Reply via email to