On Monday 31 July 2006 23:58, Ian Cheong wrote:
> I think David said they might have got those settings wrong to start.
> So he wants to convert them after the fact, which is no problem.
>
> I am seriously looking at using Acrobat for scanning, as it can do
> attempted OCR and still keep bitmaps of the bits it can't OCR
> properly. So the resulting pdfs are text searchable. OCR speed
> appears to a minor problem. (Usual problem of compression CPUs vs
> storage MB tradeoff.)

And how would Acrobat know it failed with the OCR?
I really tried it a lot. Especially with distinguishing numbers form 
characters there are still big problems, and depending on font some numbers 
(e.g. 1 and 7) get frequently mixed up - a catastrophy waiting to happen if 
you rely on such machine interpreted data

And why would you use PDF as file format? PDF is designed to allow printing / 
displaying of a document exactly as the original author intended (metrics) - 
something that will be close to impossible after the OCR process.

Why not rather a human readable format like HTML / SGML / OASIS which for 
practical purposes will achieve the same and are even easier to search?

> I have not been able to find any good open source OCR products yet,
> as the technology is apparently hard work.

You can strike that "open source"  - I haven't been able to fin *any* 
trustworthy OCR software yet. But you are right - commercial products are 
still significantly better than available FOSS alternatives in OCR

Horst
_______________________________________________
Gpcg_talk mailing list
[email protected]
http://ozdocit.org/cgi-bin/mailman/listinfo/gpcg_talk

Reply via email to