On Friday 17 July 2009 17:26:56 tmbdev wrote: > Full, high quality PDF import actually depends on the kind of PDF. Indeed.
[Note: re-ordered paragraphs follow] > And then there are many other different kinds of PDFs; for those, it > is probably best to render them at 300dpi and then hand the rendered > images over to OCRopus like TIFF input. This is basically what I'm doing (although I chose 150dpi - its an parameter though). > Many PDFs are just collections of scanned page images. In those > cases, the best thing to do is to extract the page images and hand > them to OCRopus directly. If those images contain OCR text from > Distiller, that, too, is potentially useful and it would be good to > extract that so that OCRopus can combine it with its own results. I don't do that yet. I render, rather than extracting images. It might not be hard to detect this though. We'd also need to decide how to match up page images and text. > Other PDFs are purely typeset and contain perfect textual > information. In those case, OCRopus layout analysis is still useful, > but we should find a way of extracting the text itself and skipping > the OCR step itself. This isn't necessarily too hard either. Presumably we want to get the text, and the bounding box / position for each character? Brad --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
