But we're talking about PDF input, not PDF output, aren't we? Once the images are in OCRopus, OCRopus' PDF generation commands (existing and future) take care of matching up OCR'ed text and images.
Tom On Sat, Jul 18, 2009 at 09:47, Jeffrey Ratcliffe<[email protected]> wrote: > > 2009/7/18 Thomas Breuel <[email protected]>: >>>> Many PDFs are just collections of scanned page images. In those >>>> cases, the best thing to do is to extract the page images and hand >>>> them to OCRopus directly. If those images contain OCR text from >>>> Distiller, that, too, is potentially useful and it would be good to >>>> extract that so that OCRopus can combine it with its own results. >>> I don't do that yet. I render, rather than extracting images. It might not >>> be >>> hard to detect this though. We'd also need to decide how to match up page >>> images and text. >> >> Match up in what sense? > > I embed the OCR output behind the scanned image so that to some extend > you can highlight, copy and paste the text in the correct places. > > Regards > > Jeff > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
