Re: [GSoC 2014]Optical Character Recognition project - Introduction

DImuthu Upeksha Sun, 16 Mar 2014 06:16:23 -0700

Hi John,

For now I'm using those methods to debug the wrapper. I'll remove
those methods after I finished testing it.


I started implementing OCR-plugin [1] for PDFBox. Currently it
satisfies basic requirements such as getting word+location data [2].
Please have a look at that and let me know if any changes are
required.

[1] https://github.com/DImuthuUpe/OCR-Plugin
[2] 
https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/ocr/OCRConnector.java

Thanks
Dimuthu

On Fri, Mar 14, 2014 at 12:09 AM, John Hewson <j...@jahewson.com> wrote:
> Thanks, I saw your new refactoring too, it's good. Now the following methods 
> are no longer needed:
>
> public void setImagePath(String path)
> public void setImage(byte[] imagedata, int width, int height, int bpp,int bpl)
>
> Cheers
>
> -- John
>
> On 11 Mar 2014, at 22:58, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote:
>
>> Hi John,
>> Yes. I implemented a new method to accept byte streams of the image as
>> an input. We directly can't send BufferedImage objects to native side.
>> So what I did is converting buffered image into a byte array and
>> passed it in to native side. At the native side it again converts in
>> to compatible format. With that request we need to pass some metadata
>> of byte stream like image width, height, bytes per pixel and bytes per
>> row. I checked it with this [2] test case and it works fine.
>>
>> [1] 
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74
>> [2] 
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java
>>
>> Thanks
>> Dimuthu
>>
>> On Wed, Mar 12, 2014 at 12:40 AM, John Hewson <j...@jahewson.com> wrote:
>>> Hi Dimuthu
>>>
>>> The Tesseract wrapper needs to take its input from a BufferedImage rather 
>>> than reading a file from disk, so instead of:
>>>
>>> api.setImagePath("test.tif");
>>>
>>> What we need is:
>>>
>>> BufferedImage image = ImageIO.read(new File("test.tif"));
>>> api.setImagePath(image);
>>>
>>> Because this will let us used the BufferedImage generated by PDFRenderer 
>>> without round-tripping to the disk.
>>>
>>> -- John
>>>
>>> On 11 Mar 2014, at 11:13, DImuthu Upeksha <dimuthu.upeks...@gmail.com> 
>>> wrote:
>>>
>>>> Hi John,
>>>> Thanks for the guidance.
>>>> I did a small analysis of the accuracy and performance of new
>>>> Tesseract wrapper. I used this [1] image as the input image and got
>>>> following data [2] after OCR. First line is the recognised word
>>>> followed by location details (bounding box) of the word. I think these
>>>> details are pretty much enough for our task. Now what remaining is
>>>> converting pdf file into a image as you have mentioned. These days I'm
>>>> working on it.
>>>>
>>>> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
>>>> [2] https://gist.github.com/DImuthuUpe/9491660
>>>>
>>>> Thanks
>>>> Dimuthu
>>>>
>>>> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <j...@jahewson.com> wrote:
>>>>> Dimuthu,
>>>>>
>>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can 
>>>>>> be
>>>>>> build using maven. Some useful methods that are needed to do basic OCR 
>>>>>> were
>>>>>> implemented.
>>>>>
>>>>> Great, it's looking good, nice and clean.
>>>>>
>>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>>> page.findRotation() );
>>>>>
>>>>> A PDF file is made up of pages, each of which contains a "content 
>>>>> stream". This content stream contains a list of drawing commands such as 
>>>>> "move to 10,15" or "write the word `foo`", these are called operators. 
>>>>> The processStream function reads the stream for the current page and 
>>>>> executes each of the operators. The operators themselves are implemented 
>>>>> each in their own class which is a subclass of PDFOperator. The 
>>>>> constructor of PDFStreamEngine creates the operator classes using 
>>>>> reflection, which is rather odd and I'm not sure why this design was 
>>>>> chosen. The operators used by PDFTextStripper can be found in 
>>>>> org/apache/pdfbox/resources/PDFTextStripper.properties
>>>>>
>>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is 
>>>>>> the better approach to do it?
>>>>>
>>>>> You could subclass PDFTextStripper and override the startDocument method 
>>>>> and use it to create a PDFRenderer and store it in a field. Then override 
>>>>> the processPage method and use the previously created PDFRenderer to 
>>>>> render the current page to a buffered image and perform OCR on the image. 
>>>>> Once you have the OCR text + positions, instead of calling processStream 
>>>>> you can call processTextPosition once for each character + position.
>>>>>
>>>>> The PDFRenderer class was just added to the trunk, so make sure you do an 
>>>>> "svn update". Let me know if you need me to change PDFTextStripper to 
>>>>> make it easier to subclass.
>>>>>
>>>>> Cheers
>>>>>
>>>>> -- John
>>>>>
>>>>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <dimuthu.upeks...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> Hi John,
>>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can 
>>>>>> be
>>>>>> build using maven. Some useful methods that are needed to do basic OCR 
>>>>>> were
>>>>>> implemented.
>>>>>>
>>>>>> I went through PDFBox code several times and got couple of issues that 
>>>>>> are
>>>>>> needed to be clarified
>>>>>>
>>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>>> page.findRotation() );
>>>>>>
>>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the
>>>>>> better approach to do it?
>>>>>>
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>>>>>> <dimuthu.upeks...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi John
>>>>>>> I refactored Tesseract JNI code to support maven build. To create the 
>>>>>>> JNI
>>>>>>> library I added pre-built static libraries of Tesseract and Leptonica to
>>>>>>> resources folder[2]. For now it includes librararies supported for mac. 
>>>>>>> But
>>>>>>> we can easily add both windows and linux libraries. After "mvn clean
>>>>>>> install", the jar is created under target folder. Now all setting up is
>>>>>>> done. What remains is implementing those native methods in 
>>>>>>> tessbaseapi.cpp
>>>>>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>>>>>> about project structure.
>>>>>>>
>>>>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>>>>>> [2]
>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>>>>>> [3]
>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>>>>>
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote:
>>>>>>>
>>>>>>>> Dimuthu
>>>>>>>>
>>>>>>>>> There is a lot of code
>>>>>>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>>>>>>> casting which will create terrible memory leaks in 64 bit environments
>>>>>>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>>>>>>> is
>>>>>>>>> much better.
>>>>>>>>
>>>>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>>>>>> support
>>>>>>>> 64-bit JVMs.
>>>>>>>>
>>>>>>>>> we can use
>>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>>> it is
>>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>>> Leptonica
>>>>>>>>> is under apache licence.
>>>>>>>>
>>>>>>>> Sounds good, I found the following in the README:
>>>>>>>>
>>>>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer 
>>>>>>>> compiles
>>>>>>>> without Leptonica.
>>>>>>>>
>>>>>>>> Which makes sense.
>>>>>>>>
>>>>>>>> -- John
>>>>>>>>
>>>>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <dimuthu.upeks...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi John,
>>>>>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>>>>>> side.
>>>>>>>>> It reduces lot of complexities. I don't know whether you have noticed 
>>>>>>>>> or
>>>>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>>>>>>> Mac
>>>>>>>>> but don't know about other operating systems.
>>>>>>>>>
>>>>>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>>>>>> tesseract
>>>>>>>>> do is using image processing algorithms in Leptonica to implement its
>>>>>>>> OCR
>>>>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>>>>>> API.
>>>>>>>>> You can see it includes allheaders.h header file which is the main
>>>>>>>> header
>>>>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first 
>>>>>>>>> and
>>>>>>>>> link it when we build Tesseract. This is not a big problem if we can 
>>>>>>>>> use
>>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>>> it is
>>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>>> Leptonica
>>>>>>>>> is under apache licence.
>>>>>>>>>
>>>>>>>>> I'm working on the maven implementation you have mentioned and will 
>>>>>>>>> get
>>>>>>>>> back to you soon.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Dimuthu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>>>>>> [2]
>>>>>>>>>
>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Dimuthu,
>>>>>>>>>>
>>>>>>>>>> 1,2,3:
>>>>>>>>>>
>>>>>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>>>>>> code as
>>>>>>>>>> you see fit.
>>>>>>>>>> The JNI binding should be minimal, only the methods you require need
>>>>>>>> to be
>>>>>>>>>> wrapped.
>>>>>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>>>>>> example if it is easier
>>>>>>>>>> to convert a BufferedImage to a byte array in Java then do it there 
>>>>>>>>>> and
>>>>>>>>>> pass the result
>>>>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same 
>>>>>>>>>> result.
>>>>>>>>>>
>>>>>>>>>> Your GitHub repo looks like a good start, I can make comments there 
>>>>>>>>>> as
>>>>>>>>>> things progress.
>>>>>>>>>>
>>>>>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>>>>>> impression that it was
>>>>>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>>>>>
>>>>>>>>>> 4:  The native platform library should be built as part of the Maven
>>>>>>>> build
>>>>>>>>>> for the Tesseract
>>>>>>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>>>>>>> which contains the
>>>>>>>>>> native binaries. It should be possible for the jar to contain 
>>>>>>>>>> prebuilt
>>>>>>>>>> binaries for all platforms
>>>>>>>>>> but this is something we can worry about later. Right now the goal
>>>>>>>> should
>>>>>>>>>> be to build a jar
>>>>>>>>>> containing just the current platform's native binary and any Java
>>>>>>>> wrapper
>>>>>>>>>> code.
>>>>>>>>>>
>>>>>>>>>> -- John
>>>>>>>>>>
>>>>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <dimuthu.upeks...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi John,
>>>>>>>>>>>
>>>>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>>>>>> observation
>>>>>>>>>>>
>>>>>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this 
>>>>>>>>>>> library.
>>>>>>>>>>>
>>>>>>>>>>> 2. But I can understand underlying logic in each function. Basically
>>>>>>>> what
>>>>>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>>>>>> methods.
>>>>>>>>>>> In between it does to some image <=> byte array like conversions by
>>>>>>>> using
>>>>>>>>>>> that bitmap libraries in Android
>>>>>>>>>>>
>>>>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>>>>>> with
>>>>>>>>>> our
>>>>>>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>>>>>>> will
>>>>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>>>>>> implement using our codes
>>>>>>>>>>>
>>>>>>>>>>> I think 2nd solution is better because we need only few operations 
>>>>>>>>>>> to
>>>>>>>> be
>>>>>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>>>>>> this.
>>>>>>>>>>> It's still not finished. I need to add some make files and build
>>>>>>>> files to
>>>>>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>>>>>> functions
>>>>>>>>>>> [3]. This may take some time.
>>>>>>>>>>>
>>>>>>>>>>> 4. Because we are calling native libraries we need different builds 
>>>>>>>>>>> of
>>>>>>>>>>> tesseract and leptonica libraries for each platform (dll for 
>>>>>>>>>>> windows,
>>>>>>>> so
>>>>>>>>>>> for linux, dylib for mac). So we may need to build those libraries 
>>>>>>>>>>> at
>>>>>>>> the
>>>>>>>>>>> time we build pdfbox project. Or we can pre build those libraries 
>>>>>>>>>>> and
>>>>>>>> add
>>>>>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>>>>>> preferred
>>>>>>>>>>> way?
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>>>>>> [4]
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Dimuthu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>>>>>> dimuthu.upeks...@gmail.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>>>>>
>>>>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>>>>>> tessaract
>>>>>>>>>>>> api.
>>>>>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>>>>>> think
>>>>>>>>>> we
>>>>>>>>>>>> need to write our own make files to build this in to a dll(windows)
>>>>>>>> or
>>>>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>>>>>>> for
>>>>>>>>>> a
>>>>>>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>>>>>>> suggest
>>>>>>>>>>>> if you have a better approach
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>>>>>> [2]
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>>>>>> [3]
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <j...@jahewson.com>
>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>>>>>> component,
>>>>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>>>>>> Extractor".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>>>>>> clear
>>>>>>>>>>>>> where the process starts.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>>>>>> dimuthu.upeks...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <j...@jahewson.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I should add that the OCR engine should be pluggable so 
>>>>>>>>>>>>>>> PDFToText
>>>>>>>>>> might
>>>>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>>>>>> TesseractOCREngine
>>>>>>>>>>>>>>> class somewhere which provides the required functionality and
>>>>>>>> lives
>>>>>>>>>> in
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> separate jar file.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <dimuthu.upeks...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub 
>>>>>>>>>>>>>>> system(something
>>>>>>>>>>>>> like an
>>>>>>>>>>>>>>> API)?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: "John Hewson" <j...@jahewson.com>
>>>>>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>>>>>> To: "dev@pdfbox.apache.org" <dev@pdfbox.apache.org>
>>>>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>>>>>>> Introduction
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates 
>>>>>>>>>>>>>>>> and
>>>>>>>>>> page
>>>>>>>>>>>>>>> rotation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>>>>>> have
>>>>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>>>>>> glyphs. We
>>>>>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>>>>>> dimuthu.upeks...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>>>>>> and
>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> images with text(Scanned images). In that case first we 
>>>>>>>>>>>>>>>>> extract
>>>>>>>>>> those
>>>>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>>>>>> extracted
>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>>>>>> j...@jahewson.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image 
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for 
>>>>>>>>>>>>>>>>>>> further
>>>>>>>>>>>>>>> accurate
>>>>>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>>>>>> OCR on
>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR 
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>> extract
>>>>>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF 
>>>>>>>>>>>>>>>>>> files
>>>>>>>>>> are
>>>>>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where 
>>>>>>>>>>>>>>>>>> OCR
>>>>>>>> is
>>>>>>>>>>>>>>> useful:
>>>>>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>>>>>> extracted
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>>>>>> letters.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>>>>>> Configurations
>>>>>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new 
>>>>>>>>>>>>>>>>>>>>> Java
>>>>>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>>>>>> code.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>>>>>> new
>>>>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>>>>>> folder
>>>>>>>>>> of
>>>>>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>>>>>> project
>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't 
>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>> a
>>>>>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>>>>>> Tilman
>>>>>>>>>>>>> said
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use 
>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>> other
>>>>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>>>>>> j...@jahewson.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>>>>>> PDFToText
>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>>>>>>> as
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>>>>>>> and
>>>>>>>>>> I
>>>>>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used 
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>> jars
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>>>>>> look
>>>>>>>>>>>>> into
>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and 
>>>>>>>>>>>>>>>>>>>>>>> checking
>>>>>>>>>> them
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the 
>>>>>>>>>>>>>>>>>>>>>>> way
>>>>>>>>>> you
>>>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>>>>>> managed to
>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll 
>>>>>>>>>>>>>>>>>>>>>>> drop
>>>>>>>>>> you a
>>>>>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>>>>>> j...@jahewson.com
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>>>>>> PDFBox
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>>>>>> details
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>>>>>>> are
>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>>>>>> PageDrawer
>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>>>>>> (e.g.
>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is 
>>>>>>>>>>>>>>>>>>>>>>>> how
>>>>>>>>>> text
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>>>>>> reading
>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured 
>>>>>>>>>>>>>>>>>>>>>>>> format
>>>>>>>>>> like
>>>>>>>>>>>>>>> HTML
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>>>>>> any
>>>>>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>>>>>> Undergraduate
>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>>>>>>> 2013
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>>>>>> image
>>>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as 
>>>>>>>>>>>>>>>>>>>>>>>> my
>>>>>>>>>> GSoC
>>>>>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>>>>>> me. In
>>>>>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our 
>>>>>>>>>>>>>>>>>>>>>>>> group
>>>>>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>>>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>>
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>> Undergraduate
>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>
>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards
>>>>>>>
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>>
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards
>>>>>>
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>>
>>>>>> University of Moratuwa, Sri Lanka
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards
>>>>
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>>
>>>> Department of Computer Science And Engineering
>>>>
>>>> University of Moratuwa, Sri Lanka
>>>
>>
>>
>>
>> --
>> Regards
>>
>> W.Dimuthu Upeksha
>> Undergraduate
>>
>> Department of Computer Science And Engineering
>>
>> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate

Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to