yep that would be great ! The hOCR format looks really good as a source for building text layers. I'm trying to use ocropus with PDFBox (and JAI for better image quality) extraction in order to perform text searches on non- searchable pdfs. The results are interesting. I first tried to build an image from a whole page then produce a complete hOCR file and convert coordinates to pdf conventions. I've better results when extracting each image from a page, then build separate hOCR files for each image (I'm working on press articles and the layout detection is better this way). I perform the search on the hOCR file then translate coordinates using image position/size on the page. I wish I could train tesseract engine better and in an easier way, I'd have better results, I bet !
On Dec 5, 6:06 am, farmer <[EMAIL PROTECTED]> wrote: > Hi, > Are there any plans to implement the ability to produce text- > searchable PDFs from image files? > > Thanks, > Farmer --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
