[jira] [Updated] (PDFBOX-1912) Optical Character Recognition (OCR)

John Hewson (JIRA) Tue, 25 Feb 2014 09:59:18 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Hewson updated PDFBOX-1912:
--------------------------------

    Environment: JDK 6, C/C++  (was: JDK 6, C++)

> Optical Character Recognition (OCR)
> -----------------------------------
>
>                 Key: PDFBOX-1912
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1912
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: JDK 6, C/C++
>            Reporter: John Hewson
>            Assignee: John Hewson
>              Labels: gsoc2014
>
> Brief explanation: The PDFBox library is widely used to extract text from PDF 
> files. However, many PDF files embed text in a malformed manner which renders 
> text extraction useless. There has recently been interest in extracting 
> governmental data from PDF files, the PDF Liberation commons being a notable 
> example, see https://github.com/pdfliberation for more details.
> Many end-users of PDFBox have been making use of OCR tools such as Google's 
> Tesseract https://code.google.com/p/tesseract-ocr/ which are run on the final 
> image generated by PDFBox. We think that by adding a more integrated OCR API 
> to PDFBox it will be possible to do a better job. PDFBox often has access to 
> encoding and positioning information for individual glyphs. Even when their 
> extracted text is meaningless, a character-by-character, or line-by-line OCR 
> could be more accurate. PDFBox also has information such as image orientation 
> which could allow it to better perform OCR on pages such as embedded 
> landscape tables.
> There are existing JNI bindings for Tesseract available at 
> https://code.google.com/p/tesseract-android-tools/
> Expected results: To extend PDF box with an API which allows external OCR 
> tools to be plugged-in, and an implementation of a Tesseract plug-in using 
> either JNI or the command line via Process.exec.
> Knowledge Prerequisite: Java, (JNI a bonus)
> Mentor: John Hewson
> PMC Note: Tesseract  is under the Apache License 2.0
> To learn more about PDFBox, please visit http://pdfbox.apache.org/



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PDFBOX-1912) Optical Character Recognition (OCR)

Reply via email to