Re: [GSoC 2014]Optical Character Recognition project - Introduction

Tilman Hausherr Mon, 24 Feb 2014 22:43:27 -0800

The best is to build from the complete source and to use that one. Thenyou should be able to set breakpoints within the PDFBOX source code.(Thats what I do all the time)


Tilman


Am 25.02.2014 07:38, schrieb DImuthu Upeksha:

Hi John,
Thanks for the reply. Yes I checked out PDFBox code and managed to build
code successfully. I looked at the classes you mentioned and I got a rough
idea about how they are working. To check them I used the jars in target
folder to my separate java project. I tried samples in
http://pdfbox.apache.org/cookbook/. I need to further look into code
specially how those processXXX() methods work in PDFTextStripper class.
What I usually do is adding some berakpoints and checking them in debug
windows. But using jars it's not possible. What is the way you follow in
order to do such task?

As well I installed tesseract in to my machine and managed to do some OCR
stuff also. That's a cool tool which works fine.
I'm still learning the code. If I get any issue I'll drop you a mail.

Thanks
Dimuthu


On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]> wrote:

Hi Dimuthu

The PDFBox website can be found at http://pdfbox.apache.org/ it contains
a basic overview of the project
and details on how to obtain the source code and build PDFBox for yourself.

Currently we do not perform any OCR and PDFBOX-1912 details the only
thoughts so far regarding it.
Note that the OCR libraries mentioned in the JIRA issue are all under the
Apache license, which is a
requirement.

Once you have the source code, take a look at the PageDrawer class to see
how text and images are
rendered. We want someone to interface at a low-level (e.g. one glyph,
word, or sentence at a time) with
an OCR engine. Also look at PDFTextStripper which is how text is currently
extracted, take a look at how
we have to go to great length to sort text back into reading order and
infer the placement of diacritics - PDF
is fundamentally a visual format, not a structured format like HTML -
which is why extracting text can be so
difficult sometimes.

The full PDF Reference document can be found at:

http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Feel free to discuss specifics of your proposal or ask any questions.

Thanks,

-- John

On 23 Feb 2014, at 21:13, DImuthu Upeksha <[email protected]>
wrote:

Hi,
I am Dimuthu Upeksha, a Computer Engineering Undergraduate at University

of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with Apache
ISIS [1] project. I'm very much interested in OCR and image processing
stuff. So I would like to select this project idea as my GSoC 2014 project
because I feel like it is the best suited project for me. In university
also we have done some research in OCR area and our group wrote a
literature review about increasing efficiency of OCR systems(attached). Can
you please suggest me where to start learning about PDFBox?

[1]

http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29

Thank you
Dimuthu

--
Regards
W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering
University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to