On Monday 31 July 2006 23:58, Ian Cheong wrote: > I think David said they might have got those settings wrong to start. > So he wants to convert them after the fact, which is no problem. > > I am seriously looking at using Acrobat for scanning, as it can do > attempted OCR and still keep bitmaps of the bits it can't OCR > properly. So the resulting pdfs are text searchable. OCR speed > appears to a minor problem. (Usual problem of compression CPUs vs > storage MB tradeoff.)
And how would Acrobat know it failed with the OCR? I really tried it a lot. Especially with distinguishing numbers form characters there are still big problems, and depending on font some numbers (e.g. 1 and 7) get frequently mixed up - a catastrophy waiting to happen if you rely on such machine interpreted data And why would you use PDF as file format? PDF is designed to allow printing / displaying of a document exactly as the original author intended (metrics) - something that will be close to impossible after the OCR process. Why not rather a human readable format like HTML / SGML / OASIS which for practical purposes will achieve the same and are even easier to search? > I have not been able to find any good open source OCR products yet, > as the technology is apparently hard work. You can strike that "open source" - I haven't been able to fin *any* trustworthy OCR software yet. But you are right - commercial products are still significantly better than available FOSS alternatives in OCR Horst _______________________________________________ Gpcg_talk mailing list [email protected] http://ozdocit.org/cgi-bin/mailman/listinfo/gpcg_talk
