Re: [GSoC 2014]Optical Character Recognition project - Introduction

DImuthu Upeksha Sat, 01 Mar 2014 10:10:23 -0800

I updated necessary changes to the document [1]

For last two days I had a deep look at this [2] jni wrapper for tessaract
api.
Unfortunately this has been designed for Android environment so I think we
need to write our own make files to build this in to a dll(windows) or
dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a
way to convert it to a make file that we can run on console. Please suggest
if you have a better approach


[1]
https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
[2]
https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
[3]
https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk


On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]> wrote:

> This is a good start. However, there is no need for the Adder component,
> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor".
>
> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
> where the process starts.
>
> -- John
>
> On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]>
> wrote:
>
> > Sorry for the mistake. I added it to my Dropbox [1].
> >
> > [1]
> >
> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
> >
> > Thanks
> > Dimuthu
> >
> >
> > On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]> wrote:
> >
> >> I should add that the OCR engine should be pluggable so PDFToText might
> >> use an interface, e.g. OCREngine and there will be a TesseractOCREngine
> >> class somewhere which provides the required functionality and lives in a
> >> separate jar file.
> >>
> >> -- John
> >>
> >>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> wrote:
> >>>
> >>> So do you need to embed those new functionalities into existing
> >> PDFtoText algorithms or package them as a new sub system(something like
> an
> >> API)?
> >>>
> >>> -----Original Message-----
> >>> From: "John Hewson" <[email protected]>
> >>> Sent: 26/02/2014 07:38
> >>> To: "[email protected]" <[email protected]>
> >>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
> >> Introduction
> >>>
> >>> Yes, exactly. By location data I just mean (x,y) coordinates and page
> >> rotation.
> >>>
> >>> There is another use case for OCR: some fonts embedded in PDFs have
> >> corrupt encodings, which means the ACSII codes map to the wrong glyphs.
> We
> >> could OCR the glyphs to repair the encoding.
> >>>
> >>> -- John
> >>>
> >>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <[email protected]
> >
> >> wrote:
> >>>>
> >>>> Hi John,
> >>>> Thanks for the explanation.
> >>>> Let's say there is a pdf with both text in extractable format and some
> >>>> images with text(Scanned images). In that case first we extract those
> >>>> extractable content using PDFBox algorithms and rest is extracted
> using
> >>>> OCR. Finally we pack both results together and give output as
> >> PDFToText. Am
> >>>> I correct? What do you mean by "location data"?
> >>>>
> >>>> Thanks
> >>>> Dimuthu
> >>>>
> >>>>
> >>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]>
> >> wrote:
> >>>>>
> >>>>> 1. What is called "glyphs" ?
> >>>>>
> >>>>> http://en.wikipedia.org/wiki/Glyph
> >>>>>
> >>>>>> 2. What is the main requirement of this project?
> >>>>>> As far as I understood, first we need to generate an image of
> >>>>>> malformed pdfs from
> >>>>>> PDFBox and then we need to do processing using OCR for further
> >> accurate
> >>>>>> results.  But the problem is, why shouldn't we directly do OCR on
> >> those
> >>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
> >>>>>
> >>>>> PDFBox can generate images (PDFToImage) and can extract text
> >> (PDFToText).
> >>>>> The goal of
> >>>>> this project is to enhance PDFToText so that it can use OCR to
> extract
> >>>>> text from areas of the
> >>>>> document where the text is embedded as an image. Such PDF files are
> >>>>> typically generated by
> >>>>> scanners or fax machines. There is also another case where OCR is
> >> useful:
> >>>>> some fonts embedded
> >>>>> in PDF files contain the wrong encoding, so when text is extracted
> with
> >>>>> PDFToText the result is
> >>>>> nonsense but when drawn with PDFToImage we see the correct letters.
> >>>>>
> >>>>> Instead of:
> >>>>> PDF => Image => OCR => Text
> >>>>>
> >>>>> We want to do:
> >>>>> PDF => (Many images for words + location data => OCR) => Text
> >>>>>
> >>>>> -- John
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
> >>>>> [email protected]
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> Ok fixed. This is what I did
> >>>>>>> Right click on the new project ->Debug As-> Debug Configurations
> >>>>> ->Source
> >>>>>>> ->Add -> Project
> >>>>>>> Then I selected PDFBox project.
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> Dimuthu
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> I'm using eclipse. This is what I want. I created a new Java
> >>>>> application
> >>>>>>>> project (say TestPDFBox) with a main class with following code.
> >>>>>>>>
> >>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
> >>>>> PDPage();document.addPage( blankPage
> >>>>> );document.save("BlankPage.pdf");document.close();
> >>>>>>>>
> >>>>>>>> Then I need to add those jar files generated in target folder of
> >> PDFBox
> >>>>>>>> to build path of my new project (I did build the PDFBox project
> from
> >>>>>>>> source). That is what I did. But let's say I need to check  the
> >>>>>>>> functionality of document.save("") method. But I don't have a
> >>>>> reference to
> >>>>>>>> it's sources because I directly used generated jars. As Tilman
> said
> >> I
> >>>>> built
> >>>>>>>> PDFBox from sources but I don't know a proper way to use it other
> >>>>> projects
> >>>>>>>> other than adding those jar files to build path.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Which IDE are you using? You should be able to run the PDFToText
> >> class
> >>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
> >>>>> command
> >>>>>>>>> line argument.
> >>>>>>>>>
> >>>>>>>>> -- John
> >>>>>>>>>
> >>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
> >>>>> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi John,
> >>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed
> to
> >>>>>>>>> build
> >>>>>>>>>> code successfully. I looked at the classes you mentioned and I
> >> got a
> >>>>>>>>> rough
> >>>>>>>>>> idea about how they are working. To check them I used the jars
> in
> >>>>>>>>> target
> >>>>>>>>>> folder to my separate java project. I tried samples in
> >>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into
> >> code
> >>>>>>>>>> specially how those processXXX() methods work in PDFTextStripper
> >>>>> class.
> >>>>>>>>>> What I usually do is adding some berakpoints and checking them
> in
> >>>>> debug
> >>>>>>>>>> windows. But using jars it's not possible. What is the way you
> >> follow
> >>>>>>>>> in
> >>>>>>>>>> order to do such task?
> >>>>>>>>>>
> >>>>>>>>>> As well I installed tesseract in to my machine and managed to do
> >> some
> >>>>>>>>> OCR
> >>>>>>>>>> stuff also. That's a cool tool which works fine.
> >>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a
> >> mail.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> Dimuthu
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
> [email protected]
> >>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Dimuthu
> >>>>>>>>>>>
> >>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it
> >>>>>>>>> contains
> >>>>>>>>>>> a basic overview of the project
> >>>>>>>>>>> and details on how to obtain the source code and build PDFBox
> for
> >>>>>>>>> yourself.
> >>>>>>>>>>>
> >>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the
> >> only
> >>>>>>>>>>> thoughts so far regarding it.
> >>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all
> >>>>> under
> >>>>>>>>> the
> >>>>>>>>>>> Apache license, which is a
> >>>>>>>>>>> requirement.
> >>>>>>>>>>>
> >>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
> >> class
> >>>>> to
> >>>>>>>>> see
> >>>>>>>>>>> how text and images are
> >>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one
> >>>>> glyph,
> >>>>>>>>>>> word, or sentence at a time) with
> >>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text
> is
> >>>>>>>>> currently
> >>>>>>>>>>> extracted, take a look at how
> >>>>>>>>>>> we have to go to great length to sort text back into reading
> >> order
> >>>>> and
> >>>>>>>>>>> infer the placement of diacritics - PDF
> >>>>>>>>>>> is fundamentally a visual format, not a structured format like
> >> HTML
> >>>>> -
> >>>>>>>>>>> which is why extracting text can be so
> >>>>>>>>>>> difficult sometimes.
> >>>>>>>>>>>
> >>>>>>>>>>> The full PDF Reference document can be found at:
> >>>>>
> >>
> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
> >>>>>>>>>>>
> >>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
> >>>>> questions.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> -- John
> >>>>>>>>>>>
> >>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
> >>>>> [email protected]
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
> >>>>>>>>> University
> >>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
> with
> >>>>>>>>> Apache
> >>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
> >>>>> processing
> >>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC
> >> 2014
> >>>>>>>>> project
> >>>>>>>>>>> because I feel like it is the best suited project for me. In
> >>>>>>>>> university
> >>>>>>>>>>> also we have done some research in OCR area and our group
> wrote a
> >>>>>>>>>>> literature review about increasing efficiency of OCR
> >>>>>>>>> systems(attached). Can
> >>>>>>>>>>> you please suggest me where to start learning about PDFBox?
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1]
> >>>>>
> >>
> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thank you
> >>>>>>>>>>>> Dimuthu
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Regards
> >>>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>>> Undergraduate
> >>>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Regards
> >>>>>>>>>>
> >>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>> Undergraduate
> >>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>
> >>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Regards
> >>>>>>>>
> >>>>>>>> W.Dimuthu Upeksha
> >>>>>>>> Undergraduate
> >>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>
> >>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> W.Dimuthu Upeksha
> >>>>>>> Undergraduate
> >>>>>>> Department of Computer Science And Engineering
> >>>>>>>
> >>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Regards
> >>>>>>
> >>>>>> W.Dimuthu Upeksha
> >>>>>> Undergraduate
> >>>>>> Department of Computer Science And Engineering
> >>>>>>
> >>>>>> University of Moratuwa, Sri Lanka
> >>>>
> >>>>
> >>>> --
> >>>> Regards
> >>>>
> >>>> W.Dimuthu Upeksha
> >>>> Undergraduate
> >>>> Department of Computer Science And Engineering
> >>>>
> >>>> University of Moratuwa, Sri Lanka
> >>
> >
> >
> >
> > --
> > Regards
> >
> > W.Dimuthu Upeksha
> > Undergraduate
> > Department of Computer Science And Engineering
> >
> > University of Moratuwa, Sri Lanka
>
>


-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to