2009/7/18 Thomas Breuel <[email protected]>: >>> Many PDFs are just collections of scanned page images. In those >>> cases, the best thing to do is to extract the page images and hand >>> them to OCRopus directly. If those images contain OCR text from >>> Distiller, that, too, is potentially useful and it would be good to >>> extract that so that OCRopus can combine it with its own results. >> I don't do that yet. I render, rather than extracting images. It might not be >> hard to detect this though. We'd also need to decide how to match up page >> images and text. > > Match up in what sense?
I embed the OCR output behind the scanned image so that to some extend you can highlight, copy and paste the text in the correct places. Regards Jeff --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
