PDF/A has been developed as an open standard ( ISO:19005 ) with an objective for long term storage and archiving of documents. Such archived documents would have immense value if the content therein is searchable. I understand there are a number of solutions available to generate PDF/A documents. Noteable solutions in the FOSS world: LibreOffice can export documents in PDF/A-1a format, iText Library in Jasper reports can also produce PDF/A compliant PDF documents.
Recently, HP Multi Function Printers: HP MFP 4555 have started giving option to scan to PDF/A. However, these scanned files are image PDFs and are not searchable. I am looking for a method to convert such image based PDF/A compliant PDF files to searchable PDF files so that when these documents are stored in a Document Management system, these files will become very valuable as the search can then be done on the content of these scanned files. Perhaps the scanned PDF/A files generated by HP MFP 4555's do not have OCR'd text. Tesseract has been one of the powerful OCR engines in the FOSS world. Several addons for tesseract are also available at http://code.google.com/p/tesseract-ocr/wiki/AddOns . But I am unable to find if there is a way to generate text searchable PDF files using tesseract with scanned PDF/A as input files. Can any one share their experience in generating searchable text in scanned PDF files?? anand Anand Shankar _______________________________________________ Ilugd mailing list [email protected] http://frodo.hserus.net/mailman/listinfo/ilugd
