Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Mon, 10 Mar 2014 01:31:26 -0700

Dimuthu,

That’s looking really good. You just need to use the following Apache header on 
your Java source files:


/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

-- John

On 7 Mar 2014, at 07:56, DImuthu Upeksha <[email protected]> wrote:

> Hi John
> I refactored Tesseract JNI code to support maven build. To create the JNI
> library I added pre-built static libraries of Tesseract and Leptonica to
> resources folder[2]. For now it includes librararies supported for mac. But
> we can easily add both windows and linux libraries. After "mvn clean
> install", the jar is created under target folder. Now all setting up is
> done. What remains is implementing those native methods in tessbaseapi.cpp
> [3]. Hope to finish it asap. Please let me know if there is any concern
> about project structure.
> 
> [1] https://github.com/DImuthuUpe/Tesseract-API.git
> [2]
> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
> [3]
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
> 
> Thanks
> Dimuthu
> 
> 
> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
> 
>> Dimuthu
>> 
>>> There is a lot of code
>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>> casting which will create terrible memory leaks in 64 bit environments
>>> because ponters are 64 bit. So I believe writing it from the beginning is
>>> much better.
>> 
>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>> support
>> 64-bit JVMs.
>> 
>>> we can use
>>> the static library of Leptonica (I did and it worked nicely). I think it
>> is
>>> not a issue to use it's static library because both Tesseract and
>> Leptonica
>>> is under apache licence.
>> 
>> Sounds good, I found the following in the README:
>> 
>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>> without Leptonica.
>> 
>> Which makes sense.
>> 
>> -- John
>> 
>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <[email protected]>
>> wrote:
>> 
>>> Hi John,
>>> +1 for you suggestion about converting image <=> byte array at java side.
>>> It reduces lot of complexities. I don't know whether you have noticed or
>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>> Mac
>>> but don't know about other operating systems.
>>> 
>>> Leptonica is the image processing library for Tesseract [1]. What
>> tesseract
>>> do is using image processing algorithms in Leptonica to implement its OCR
>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>> API.
>>> You can see it includes allheaders.h header file which is the main header
>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>> link it when we build Tesseract. This is not a big problem if we can use
>>> the static library of Leptonica (I did and it worked nicely). I think it
>> is
>>> not a issue to use it's static library because both Tesseract and
>> Leptonica
>>> is under apache licence.
>>> 
>>> I'm working on the maven implementation you have mentioned and will get
>>> back to you soon.
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> 
>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>> [2]
>>> 
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>> 
>>> 
>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>>> 
>>>> Hi Dimuthu,
>>>> 
>>>> 1,2,3:
>>>> 
>>>> Feel free to write your own Tesseract binding or port the existing code
>> as
>>>> you see fit.
>>>> The JNI binding should be minimal, only the methods you require need to
>> be
>>>> wrapped.
>>>> Also, don't forget that some of the interop can be done in Java, for
>>>> example if it is easier
>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>> pass the result
>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>> 
>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>> things progress.
>>>> 
>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>> impression that it was
>>>> used for image i/o only, but I may be misinformed.
>>>> 
>>>> 4:  The native platform library should be built as part of the Maven
>> build
>>>> for the Tesseract
>>>> wrapper which can be a separate project. The output can be a jar file
>>>> which contains the
>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>> binaries for all platforms
>>>> but this is something we can worry about later. Right now the goal
>> should
>>>> be to build a jar
>>>> containing just the current platform's native binary and any Java
>> wrapper
>>>> code.
>>>> 
>>>> -- John
>>>> 
>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi John,
>>>>> 
>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>> observation
>>>>> 
>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>> 
>>>>> 2. But I can understand underlying logic in each function. Basically
>> what
>>>>> it does is mapping between tesseract api functions [2] with java
>> methods.
>>>>> In between it does to some image <=> byte array like conversions by
>> using
>>>>> that bitmap libraries in Android
>>>>> 
>>>>> 3. There are two ways. 1: We can port it's code to make compatible with
>>>> our
>>>>> environments(linux,windows and mac) which is really painful. Also it
>> will
>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>> implement using our codes
>>>>> 
>>>>> I think 2nd solution is better because we need only few operations to
>> be
>>>>> done using tesseract library. I have created a github repo [3] for
>> this.
>>>>> It's still not finished. I need to add some make files and build files
>> to
>>>>> make it run properly. And also I need to implement those wrapper
>>>> functions
>>>>> [3]. This may take some time.
>>>>> 
>>>>> 4. Because we are calling native libraries we need different builds of
>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>> so
>>>>> for linux, dylib for mac). So we may need to build those libraries at
>> the
>>>>> time we build pdfbox project. Or we can pre build those libraries and
>> add
>>>>> them to the project as .dll, .so or .dylib format. What is the
>> preferred
>>>>> way?
>>>>> 
>>>>> [1]
>>>>> 
>>>> 
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>> [4]
>>>>> 
>>>> 
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>> 
>>>>> Thanks
>>>>> Dimuthu
>>>>> 
>>>>> 
>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>> [email protected]
>>>>>> wrote:
>>>>> 
>>>>>> I updated necessary changes to the document [1]
>>>>>> 
>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>> tessaract
>>>>>> api.
>>>>>> Unfortunately this has been designed for Android environment so I
>> think
>>>> we
>>>>>> need to write our own make files to build this in to a dll(windows) or
>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>> for
>>>> a
>>>>>> way to convert it to a make file that we can run on console. Please
>>>> suggest
>>>>>> if you have a better approach
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>> 
>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>> [2]
>>>>>> 
>>>> 
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>> [3]
>>>>>> 
>>>> 
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>> 
>>>>>> 
>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]>
>> wrote:
>>>>>> 
>>>>>>> This is a good start. However, there is no need for the Adder
>>>> component,
>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>> Extractor".
>>>>>>> 
>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>> clear
>>>>>>> where the process starts.
>>>>>>> 
>>>>>>> -- John
>>>>>>> 
>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>> [email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>> 
>>>>>>>> [1]
>>>>>>>> 
>>>>>>> 
>>>> 
>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dimuthu
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]>
>>>> wrote:
>>>>>>>> 
>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>> might
>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>> TesseractOCREngine
>>>>>>>>> class somewhere which provides the required functionality and lives
>>>> in
>>>>>>> a
>>>>>>>>> separate jar file.
>>>>>>>>> 
>>>>>>>>> -- John
>>>>>>>>> 
>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]>
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>> like an
>>>>>>>>> API)?
>>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: "John Hewson" <[email protected]>
>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>> Introduction
>>>>>>>>>> 
>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>> page
>>>>>>>>> rotation.
>>>>>>>>>> 
>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>> have
>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>> glyphs. We
>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>> 
>>>>>>>>>> -- John
>>>>>>>>>> 
>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi John,
>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>> Let's say there is a pdf with both text in extractable format and
>>>>>>> some
>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>> those
>>>>>>>>>>> extractable content using PDFBox algorithms and rest is extracted
>>>>>>> using
>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>> PDFToText. Am
>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> Dimuthu
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>> 
>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>> 
>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>> accurate
>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR
>> on
>>>>>>>>> those
>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>> wrong.
>>>>>>>>>>>> 
>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>> (PDFToText).
>>>>>>>>>>>> The goal of
>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>> extract
>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>> are
>>>>>>>>>>>> typically generated by
>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>> is
>>>>>>>>> useful:
>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>> extracted
>>>>>>> with
>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>> letters.
>>>>>>>>>>>> 
>>>>>>>>>>>> Instead of:
>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>> 
>>>>>>>>>>>> We want to do:
>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>> 
>>>>>>>>>>>> -- John
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>> Configurations
>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>> application
>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>> code.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Then I need to add those jar files generated in target folder
>>>> of
>>>>>>>>> PDFBox
>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>> project
>>>>>>> from
>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>> the
>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have a
>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>> Tilman
>>>>>>> said
>>>>>>>>> I
>>>>>>>>>>>> built
>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>> other
>>>>>>>>>>>> projects
>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>> PDFToText
>>>>>>>>> class
>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as
>>>> the
>>>>>>>>>>>> command
>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>> managed to
>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>> and
>>>> I
>>>>>>>>> got a
>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>> jars
>>>>>>> in
>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
>>>>>>> into
>>>>>>>>> code
>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>> PDFTextStripper
>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>> them
>>>>>>> in
>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>> you
>>>>>>>>> follow
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and managed
>> to
>>>>>>> do
>>>>>>>>> some
>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>> you a
>>>>>>>>> mail.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>> [email protected]
>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>> PDFBox
>>>>>>> for
>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>> details
>>>>>>> the
>>>>>>>>> only
>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>> are
>>>>>>> all
>>>>>>>>>>>> under
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>> PageDrawer
>>>>>>>>> class
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>> (e.g.
>>>>>>> one
>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>> text
>>>>>>> is
>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>> reading
>>>>>>>>> order
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>> like
>>>>>>>>> HTML
>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>> 
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>> Undergraduate
>>>> at
>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>> 2013
>>>>>>> with
>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>> image
>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>> GSoC
>>>>>>>>> 2014
>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for me.
>> In
>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>> wrote a
>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>> PDFBox?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>> 
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> 
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>> 
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>> 
>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>> Undergraduate
>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>> 
>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
>> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to