[
https://issues.apache.org/jira/browse/PDFBOX-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Hewson updated PDFBOX-1912:
--------------------------------
Environment: JDK 6, C/C++ (was: JDK 6, C++)
> Optical Character Recognition (OCR)
> -----------------------------------
>
> Key: PDFBOX-1912
> URL: https://issues.apache.org/jira/browse/PDFBOX-1912
> Project: PDFBox
> Issue Type: Wish
> Components: Text extraction
> Affects Versions: 2.0.0
> Environment: JDK 6, C/C++
> Reporter: John Hewson
> Assignee: John Hewson
> Labels: gsoc2014
>
> Brief explanation: The PDFBox library is widely used to extract text from PDF
> files. However, many PDF files embed text in a malformed manner which renders
> text extraction useless. There has recently been interest in extracting
> governmental data from PDF files, the PDF Liberation commons being a notable
> example, see https://github.com/pdfliberation for more details.
> Many end-users of PDFBox have been making use of OCR tools such as Google's
> Tesseract https://code.google.com/p/tesseract-ocr/ which are run on the final
> image generated by PDFBox. We think that by adding a more integrated OCR API
> to PDFBox it will be possible to do a better job. PDFBox often has access to
> encoding and positioning information for individual glyphs. Even when their
> extracted text is meaningless, a character-by-character, or line-by-line OCR
> could be more accurate. PDFBox also has information such as image orientation
> which could allow it to better perform OCR on pages such as embedded
> landscape tables.
> There are existing JNI bindings for Tesseract available at
> https://code.google.com/p/tesseract-android-tools/
> Expected results: To extend PDF box with an API which allows external OCR
> tools to be plugged-in, and an implementation of a Tesseract plug-in using
> either JNI or the command line via Process.exec.
> Knowledge Prerequisite: Java, (JNI a bonus)
> Mentor: John Hewson
> PMC Note: Tesseract is under the Apache License 2.0
> To learn more about PDFBox, please visit http://pdfbox.apache.org/
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)