Re: pdf capabilities

squiggly Fri, 05 Dec 2008 02:57:04 -0800

yep that would be great !
The hOCR format looks really good as a source for building text
layers.
I'm trying to use ocropus with PDFBox (and JAI for better image
quality) extraction in order to perform text searches on non-
searchable pdfs.
The results are interesting. I first tried to build an image from a
whole page then produce a complete hOCR file and convert coordinates
to pdf conventions.
I've better results when extracting each image from a page, then build
separate hOCR files for each image (I'm working on press articles and
the layout detection is better this way). I perform the search on the
hOCR file then translate coordinates using image position/size on the
page.
I wish I could train tesseract engine better and in an easier way, I'd
have better results, I bet !



On Dec 5, 6:06 am, farmer <[EMAIL PROTECTED]> wrote:
> Hi,
> Are there any plans to implement the ability to produce text-
> searchable PDFs from image files?
>
> Thanks,
> Farmer
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: pdf capabilities

Reply via email to