On Friday 17 July 2009 17:26:56 tmbdev wrote:
> Full, high quality PDF import actually depends on the kind of PDF.
Indeed.

[Note: re-ordered paragraphs follow]
> And then there are many other different kinds of PDFs; for those, it
> is probably best to render them at 300dpi and then hand the rendered
> images over to OCRopus like TIFF input.
This is basically what I'm doing (although I chose 150dpi - its an parameter 
though).

> Many PDFs are just collections of scanned page images.  In those
> cases, the best thing to do is to extract the page images and hand
> them to OCRopus directly.  If those images contain OCR text from
> Distiller, that, too, is potentially useful and it would be good to
> extract that so that OCRopus can combine it with its own results.
I don't do that yet. I render, rather than extracting images. It might not be 
hard to detect this though. We'd also need to decide how to match up page 
images and text.
 
> Other PDFs are purely typeset and contain perfect textual
> information.  In those case, OCRopus layout analysis is still useful,
> but we should find a way of extracting the text itself and skipping
> the OCR step itself.
This isn't necessarily too hard either. Presumably we want to get the text, 
and the bounding box / position for each character?

Brad



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to