Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Mon, 10 Mar 2014 01:34:28 -0700

>  You just need to use the following Apache header on your Java source files:


Actually, no, forget that. I don’t think you can use that header yet as you 
haven’t signed a CLA.
Leave the files as they are without headers for now. We’ll deal with the 
licensing later
because your code isn’t in the official Apache repository yet.

-- John

On 10 Mar 2014, at 01:30, John Hewson <j...@jahewson.com> wrote:

> Dimuthu,
> 
> That’s looking really good. You just need to use the following Apache header 
> on your Java source files:
> 
> /*
>  * Licensed to the Apache Software Foundation (ASF) under one or more
>  * contributor license agreements.  See the NOTICE file distributed with
>  * this work for additional information regarding copyright ownership.
>  * The ASF licenses this file to You under the Apache License, Version 2.0
>  * (the "License"); you may not use this file except in compliance with
>  * the License.  You may obtain a copy of the License at
>  *
>  *      http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>  * See the License for the specific language governing permissions and
>  * limitations under the License.
>  */
> 
> -- John
> 
> On 7 Mar 2014, at 07:56, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote:
> 
>> Hi John
>> I refactored Tesseract JNI code to support maven build. To create the JNI
>> library I added pre-built static libraries of Tesseract and Leptonica to
>> resources folder[2]. For now it includes librararies supported for mac. But
>> we can easily add both windows and linux libraries. After "mvn clean
>> install", the jar is created under target folder. Now all setting up is
>> done. What remains is implementing those native methods in tessbaseapi.cpp
>> [3]. Hope to finish it asap. Please let me know if there is any concern
>> about project structure.
>> 
>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>> [2]
>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>> [3]
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>> 
>> Thanks
>> Dimuthu
>> 
>> 
>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote:
>> 
>>> Dimuthu
>>> 
>>>> There is a lot of code
>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>> casting which will create terrible memory leaks in 64 bit environments
>>>> because ponters are 64 bit. So I believe writing it from the beginning is
>>>> much better.
>>> 
>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>> support
>>> 64-bit JVMs.
>>> 
>>>> we can use
>>>> the static library of Leptonica (I did and it worked nicely). I think it
>>> is
>>>> not a issue to use it's static library because both Tesseract and
>>> Leptonica
>>>> is under apache licence.
>>> 
>>> Sounds good, I found the following in the README:
>>> 
>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>> without Leptonica.
>>> 
>>> Which makes sense.
>>> 
>>> -- John
>>> 
>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <dimuthu.upeks...@gmail.com>
>>> wrote:
>>> 
>>>> Hi John,
>>>> +1 for you suggestion about converting image <=> byte array at java side.
>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>> Mac
>>>> but don't know about other operating systems.
>>>> 
>>>> Leptonica is the image processing library for Tesseract [1]. What
>>> tesseract
>>>> do is using image processing algorithms in Leptonica to implement its OCR
>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>> API.
>>>> You can see it includes allheaders.h header file which is the main header
>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>> the static library of Leptonica (I did and it worked nicely). I think it
>>> is
>>>> not a issue to use it's static library because both Tesseract and
>>> Leptonica
>>>> is under apache licence.
>>>> 
>>>> I'm working on the maven implementation you have mentioned and will get
>>>> back to you soon.
>>>> 
>>>> Thanks
>>>> Dimuthu
>>>> 
>>>> 
>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>> [2]
>>>> 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>> 
>>>> 
>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote:
>>>> 
>>>>> Hi Dimuthu,
>>>>> 
>>>>> 1,2,3:
>>>>> 
>>>>> Feel free to write your own Tesseract binding or port the existing code
>>> as
>>>>> you see fit.
>>>>> The JNI binding should be minimal, only the methods you require need to
>>> be
>>>>> wrapped.
>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>> example if it is easier
>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>> pass the result
>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>> 
>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>> things progress.
>>>>> 
>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>> impression that it was
>>>>> used for image i/o only, but I may be misinformed.
>>>>> 
>>>>> 4:  The native platform library should be built as part of the Maven
>>> build
>>>>> for the Tesseract
>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>> which contains the
>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>> binaries for all platforms
>>>>> but this is something we can worry about later. Right now the goal
>>> should
>>>>> be to build a jar
>>>>> containing just the current platform's native binary and any Java
>>> wrapper
>>>>> code.
>>>>> 
>>>>> -- John
>>>>> 
>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <dimuthu.upeks...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi John,
>>>>>> 
>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>> observation
>>>>>> 
>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>> 
>>>>>> 2. But I can understand underlying logic in each function. Basically
>>> what
>>>>>> it does is mapping between tesseract api functions [2] with java
>>> methods.
>>>>>> In between it does to some image <=> byte array like conversions by
>>> using
>>>>>> that bitmap libraries in Android
>>>>>> 
>>>>>> 3. There are two ways. 1: We can port it's code to make compatible with
>>>>> our
>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>> will
>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>> implement using our codes
>>>>>> 
>>>>>> I think 2nd solution is better because we need only few operations to
>>> be
>>>>>> done using tesseract library. I have created a github repo [3] for
>>> this.
>>>>>> It's still not finished. I need to add some make files and build files
>>> to
>>>>>> make it run properly. And also I need to implement those wrapper
>>>>> functions
>>>>>> [3]. This may take some time.
>>>>>> 
>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>> so
>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>> the
>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>> add
>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>> preferred
>>>>>> way?
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>> [4]
>>>>>> 
>>>>> 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>> 
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>> 
>>>>>> 
>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>> dimuthu.upeks...@gmail.com
>>>>>>> wrote:
>>>>>> 
>>>>>>> I updated necessary changes to the document [1]
>>>>>>> 
>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>> tessaract
>>>>>>> api.
>>>>>>> Unfortunately this has been designed for Android environment so I
>>> think
>>>>> we
>>>>>>> need to write our own make files to build this in to a dll(windows) or
>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>> for
>>>>> a
>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>> suggest
>>>>>>> if you have a better approach
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>> 
>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>> [2]
>>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>> [3]
>>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <j...@jahewson.com>
>>> wrote:
>>>>>>> 
>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>> component,
>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>> Extractor".
>>>>>>>> 
>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>> clear
>>>>>>>> where the process starts.
>>>>>>>> 
>>>>>>>> -- John
>>>>>>>> 
>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>> dimuthu.upeks...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Dimuthu
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <j...@jahewson.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>> might
>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>> TesseractOCREngine
>>>>>>>>>> class somewhere which provides the required functionality and lives
>>>>> in
>>>>>>>> a
>>>>>>>>>> separate jar file.
>>>>>>>>>> 
>>>>>>>>>> -- John
>>>>>>>>>> 
>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <dimuthu.upeks...@gmail.com>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>> like an
>>>>>>>>>> API)?
>>>>>>>>>>> 
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: "John Hewson" <j...@jahewson.com>
>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>> To: "dev@pdfbox.apache.org" <dev@pdfbox.apache.org>
>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>> Introduction
>>>>>>>>>>> 
>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>> page
>>>>>>>>>> rotation.
>>>>>>>>>>> 
>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>> have
>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>> glyphs. We
>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>> 
>>>>>>>>>>> -- John
>>>>>>>>>>> 
>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>> dimuthu.upeks...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi John,
>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format and
>>>>>>>> some
>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>> those
>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is extracted
>>>>>>>> using
>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>> j...@jahewson.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>> accurate
>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR
>>> on
>>>>>>>>>> those
>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>> wrong.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>> extract
>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>> are
>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>> is
>>>>>>>>>> useful:
>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>> extracted
>>>>>>>> with
>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>> letters.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- John
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>> dimuthu.upeks...@gmail.com
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>> Configurations
>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>> code.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target folder
>>>>> of
>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>> project
>>>>>>>> from
>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>> the
>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have a
>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>> Tilman
>>>>>>>> said
>>>>>>>>>> I
>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>> other
>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>> j...@jahewson.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>> PDFToText
>>>>>>>>>> class
>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as
>>>>> the
>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>> dimuthu.upeks...@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>> managed to
>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>> and
>>>>> I
>>>>>>>>>> got a
>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>> jars
>>>>>>>> in
>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
>>>>>>>> into
>>>>>>>>>> code
>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>> PDFTextStripper
>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>> them
>>>>>>>> in
>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>> you
>>>>>>>>>> follow
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and managed
>>> to
>>>>>>>> do
>>>>>>>>>> some
>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>> you a
>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>> j...@jahewson.com
>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>> PDFBox
>>>>>>>> for
>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>> details
>>>>>>>> the
>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>> are
>>>>>>>> all
>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>> PageDrawer
>>>>>>>>>> class
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>> (e.g.
>>>>>>>> one
>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>> text
>>>>>>>> is
>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>> reading
>>>>>>>>>> order
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>> like
>>>>>>>>>> HTML
>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>> dimuthu.upeks...@gmail.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>> Undergraduate
>>>>> at
>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>> 2013
>>>>>>>> with
>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>> image
>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>> GSoC
>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for me.
>>> In
>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>> PDFBox?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>> 
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>> 
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> 
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Regards
>>>>>>> 
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>> 
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>>> 
>>> 
>> 
>> 
>> -- 
>> Regards
>> 
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>> 
>> University of Moratuwa, Sri Lanka
>

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to