[ilugd] Searchable PDF/A files

Anand Shankar Tue, 07 Feb 2012 23:33:35 -0800

PDF/A has been developed as an open standard ( ISO:19005 ) with an
objective for long term storage and archiving of documents. Such
archived documents would have immense value if the content therein is
searchable. I understand there are a number of solutions available to
generate PDF/A documents. Noteable solutions in the FOSS world:
LibreOffice can export documents in PDF/A-1a format, iText Library in
Jasper reports can also produce PDF/A compliant PDF documents.



Recently, HP Multi Function Printers: HP MFP 4555 have started giving
option to scan to PDF/A. However, these scanned files are image PDFs
and are not searchable. I am looking for a method to convert such
image based PDF/A compliant PDF files to searchable PDF files so that
when these documents are stored in a Document Management system, these
files will become very valuable as the search can then be done on the
content of these scanned files.

Perhaps the scanned PDF/A files generated by HP MFP 4555's do not have
OCR'd text.

Tesseract has been one of the powerful OCR engines in the FOSS world.
Several addons for tesseract are also available at
http://code.google.com/p/tesseract-ocr/wiki/AddOns . But I am unable
to find if there is a way to generate text searchable PDF files using
tesseract with scanned PDF/A as input files.

Can any one share their experience in generating searchable text in
scanned PDF files??

anand

Anand Shankar

_______________________________________________
Ilugd mailing list
[email protected]
http://frodo.hserus.net/mailman/listinfo/ilugd

[ilugd] Searchable PDF/A files

Reply via email to