Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Mon, 03 Mar 2014 12:39:36 -0800

Dimuthu

Your new diagram looks good. The JNI wrapper for Tesseract is indeed for 
Android, so it will need porting
to a standard desktop C++ environment. We use Maven to build PDFBox and there 
is a native-maven plugin
which can build JNI projects, see 
http://docs.codehaus.org/display/MAVENUSER/Projects+With+JNI the plugin
itself is here http://mojo.codehaus.org/maven-native/native-maven-plugin/.


If you’ve not used Maven before, it’s a Java build system with its own package 
repository (like rubygems or npm)
so you just write an XML file and it downloads the appropriate plugins at 
build-time as they are required.

What operating system do you develop on? I’m on OS X, but I have VMs for most 
platforms.

Thanks

-- John

On 1 Mar 2014, at 10:09, DImuthu Upeksha <[email protected]> wrote:

> I updated necessary changes to the document [1]
> 
> For last two days I had a deep look at this [2] jni wrapper for tessaract
> api.
> Unfortunately this has been designed for Android environment so I think we
> need to write our own make files to build this in to a dll(windows) or
> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a
> way to convert it to a make file that we can run on console. Please suggest
> if you have a better approach
> 
> [1]
> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
> [2]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
> [3]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
> 
> 
> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]> wrote:
> 
>> This is a good start. However, there is no need for the Adder component,
>> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor".
>> 
>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
>> where the process starts.
>> 
>> -- John
>> 
>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]>
>> wrote:
>> 
>>> Sorry for the mistake. I added it to my Dropbox [1].
>>> 
>>> [1]
>>> 
>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> 
>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]> wrote:
>>> 
>>>> I should add that the OCR engine should be pluggable so PDFToText might
>>>> use an interface, e.g. OCREngine and there will be a TesseractOCREngine
>>>> class somewhere which provides the required functionality and lives in a
>>>> separate jar file.
>>>> 
>>>> -- John
>>>> 
>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> wrote:
>>>>> 
>>>>> So do you need to embed those new functionalities into existing
>>>> PDFtoText algorithms or package them as a new sub system(something like
>> an
>>>> API)?
>>>>> 
>>>>> -----Original Message-----
>>>>> From: "John Hewson" <[email protected]>
>>>>> Sent: 26/02/2014 07:38
>>>>> To: "[email protected]" <[email protected]>
>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>> Introduction
>>>>> 
>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and page
>>>> rotation.
>>>>> 
>>>>> There is another use case for OCR: some fonts embedded in PDFs have
>>>> corrupt encodings, which means the ACSII codes map to the wrong glyphs.
>> We
>>>> could OCR the glyphs to repair the encoding.
>>>>> 
>>>>> -- John
>>>>> 
>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <[email protected]
>>> 
>>>> wrote:
>>>>>> 
>>>>>> Hi John,
>>>>>> Thanks for the explanation.
>>>>>> Let's say there is a pdf with both text in extractable format and some
>>>>>> images with text(Scanned images). In that case first we extract those
>>>>>> extractable content using PDFBox algorithms and rest is extracted
>> using
>>>>>> OCR. Finally we pack both results together and give output as
>>>> PDFToText. Am
>>>>>> I correct? What do you mean by "location data"?
>>>>>> 
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>> 
>>>>>> 
>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]>
>>>> wrote:
>>>>>>> 
>>>>>>> 1. What is called "glyphs" ?
>>>>>>> 
>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>> 
>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>> malformed pdfs from
>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>> accurate
>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR on
>>>> those
>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>>>>>>> 
>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>> (PDFToText).
>>>>>>> The goal of
>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>> extract
>>>>>>> text from areas of the
>>>>>>> document where the text is embedded as an image. Such PDF files are
>>>>>>> typically generated by
>>>>>>> scanners or fax machines. There is also another case where OCR is
>>>> useful:
>>>>>>> some fonts embedded
>>>>>>> in PDF files contain the wrong encoding, so when text is extracted
>> with
>>>>>>> PDFToText the result is
>>>>>>> nonsense but when drawn with PDFToImage we see the correct letters.
>>>>>>> 
>>>>>>> Instead of:
>>>>>>> PDF => Image => OCR => Text
>>>>>>> 
>>>>>>> We want to do:
>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>> 
>>>>>>> -- John
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>> [email protected]
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>> Right click on the new project ->Debug As-> Debug Configurations
>>>>>>> ->Source
>>>>>>>>> ->Add -> Project
>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Dimuthu
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> 
>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>> application
>>>>>>>>>> project (say TestPDFBox) with a main class with following code.
>>>>>>>>>> 
>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>>>>>> PDPage();document.addPage( blankPage
>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>> 
>>>>>>>>>> Then I need to add those jar files generated in target folder of
>>>> PDFBox
>>>>>>>>>> to build path of my new project (I did build the PDFBox project
>> from
>>>>>>>>>> source). That is what I did. But let's say I need to check  the
>>>>>>>>>> functionality of document.save("") method. But I don't have a
>>>>>>> reference to
>>>>>>>>>> it's sources because I directly used generated jars. As Tilman
>> said
>>>> I
>>>>>>> built
>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it other
>>>>>>> projects
>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Which IDE are you using? You should be able to run the PDFToText
>>>> class
>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
>>>>>>> command
>>>>>>>>>>> line argument.
>>>>>>>>>>> 
>>>>>>>>>>> -- John
>>>>>>>>>>> 
>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>> [email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi John,
>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed
>> to
>>>>>>>>>>> build
>>>>>>>>>>>> code successfully. I looked at the classes you mentioned and I
>>>> got a
>>>>>>>>>>> rough
>>>>>>>>>>>> idea about how they are working. To check them I used the jars
>> in
>>>>>>>>>>> target
>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into
>>>> code
>>>>>>>>>>>> specially how those processXXX() methods work in PDFTextStripper
>>>>>>> class.
>>>>>>>>>>>> What I usually do is adding some berakpoints and checking them
>> in
>>>>>>> debug
>>>>>>>>>>>> windows. But using jars it's not possible. What is the way you
>>>> follow
>>>>>>>>>>> in
>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>> 
>>>>>>>>>>>> As well I installed tesseract in to my machine and managed to do
>>>> some
>>>>>>>>>>> OCR
>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a
>>>> mail.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>> [email protected]
>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it
>>>>>>>>>>> contains
>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>> and details on how to obtain the source code and build PDFBox
>> for
>>>>>>>>>>> yourself.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the
>>>> only
>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all
>>>>>>> under
>>>>>>>>>>> the
>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
>>>> class
>>>>>>> to
>>>>>>>>>>> see
>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one
>>>>>>> glyph,
>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text
>> is
>>>>>>>>>>> currently
>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>> we have to go to great length to sort text back into reading
>>>> order
>>>>>>> and
>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>> is fundamentally a visual format, not a structured format like
>>>> HTML
>>>>>>> -
>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>> 
>>>> 
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>>>>>> questions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- John
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>> [email protected]
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
>>>>>>>>>>> University
>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
>> with
>>>>>>>>>>> Apache
>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>>>>>>> processing
>>>>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC
>>>> 2014
>>>>>>>>>>> project
>>>>>>>>>>>>> because I feel like it is the best suited project for me. In
>>>>>>>>>>> university
>>>>>>>>>>>>> also we have done some research in OCR area and our group
>> wrote a
>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>> you please suggest me where to start learning about PDFBox?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [1]
>>>>>>> 
>>>> 
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>> 
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>> 
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>> 
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> 
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
>> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to