Re: [GSoC 2014]Optical Character Recognition project - Introduction

DImuthu Upeksha Tue, 11 Mar 2014 11:23:22 -0700

Hi John,
Thanks for the guidance.
I did a small analysis of the accuracy and performance of new
Tesseract wrapper. I used this [1] image as the input image and got
following data [2] after OCR. First line is the recognised word
followed by location details (bounding box) of the word. I think these
details are pretty much enough for our task. Now what remaining is
converting pdf file into a image as you have mentioned. These days I'm
working on it.


[1]https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
[2] https://gist.github.com/DImuthuUpe/9491660

Thanks
Dimuthu

On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <[email protected]> wrote:
> Dimuthu,
>
>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>> build using maven. Some useful methods that are needed to do basic OCR were
>> implemented.
>
> Great, it's looking good, nice and clean.
>
>> 1. What is the task of processStream method in PDFTextStripper class line
>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>> page.findRotation() );
>
> A PDF file is made up of pages, each of which contains a "content stream". 
> This content stream contains a list of drawing commands such as "move to 
> 10,15" or "write the word `foo`", these are called operators. The 
> processStream function reads the stream for the current page and executes 
> each of the operators. The operators themselves are implemented each in their 
> own class which is a subclass of PDFOperator. The constructor of 
> PDFStreamEngine creates the operator classes using reflection, which is 
> rather odd and I'm not sure why this design was chosen. The operators used by 
> PDFTextStripper can be found in 
> org/apache/pdfbox/resources/PDFTextStripper.properties
>
>> 2. Say I need to extract images and it's metadata from a pdf. What is the 
>> better approach to do it?
>
> You could subclass PDFTextStripper and override the startDocument method and 
> use it to create a PDFRenderer and store it in a field. Then override the 
> processPage method and use the previously created PDFRenderer to render the 
> current page to a buffered image and perform OCR on the image. Once you have 
> the OCR text + positions, instead of calling processStream you can call 
> processTextPosition once for each character + position.
>
> The PDFRenderer class was just added to the trunk, so make sure you do an 
> "svn update". Let me know if you need me to change PDFTextStripper to make it 
> easier to subclass.
>
> Cheers
>
> -- John
>
> On 9 Mar 2014, at 09:08, DImuthu Upeksha <[email protected]> wrote:
>
>> Hi John,
>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>> build using maven. Some useful methods that are needed to do basic OCR were
>> implemented.
>>
>> I went through PDFBox code several times and got couple of issues that are
>> needed to be clarified
>>
>> 1. What is the task of processStream method in PDFTextStripper class line
>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>> page.findRotation() );
>>
>> 2. Say I need to extract images and it's metadata from a pdf. What is the
>> better approach to do it?
>>
>> Thanks
>> Dimuthu
>>
>>
>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>> <[email protected]>wrote:
>>
>>> Hi John
>>> I refactored Tesseract JNI code to support maven build. To create the JNI
>>> library I added pre-built static libraries of Tesseract and Leptonica to
>>> resources folder[2]. For now it includes librararies supported for mac. But
>>> we can easily add both windows and linux libraries. After "mvn clean
>>> install", the jar is created under target folder. Now all setting up is
>>> done. What remains is implementing those native methods in tessbaseapi.cpp
>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>> about project structure.
>>>
>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>> [2]
>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>> [3]
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>
>>> Thanks
>>> Dimuthu
>>>
>>>
>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>>>
>>>> Dimuthu
>>>>
>>>>> There is a lot of code
>>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>>> casting which will create terrible memory leaks in 64 bit environments
>>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>>> is
>>>>> much better.
>>>>
>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>> support
>>>> 64-bit JVMs.
>>>>
>>>>> we can use
>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>> it is
>>>>> not a issue to use it's static library because both Tesseract and
>>>> Leptonica
>>>>> is under apache licence.
>>>>
>>>> Sounds good, I found the following in the README:
>>>>
>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>>> without Leptonica.
>>>>
>>>> Which makes sense.
>>>>
>>>> -- John
>>>>
>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi John,
>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>> side.
>>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>>> Mac
>>>>> but don't know about other operating systems.
>>>>>
>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>> tesseract
>>>>> do is using image processing algorithms in Leptonica to implement its
>>>> OCR
>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>> API.
>>>>> You can see it includes allheaders.h header file which is the main
>>>> header
>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>> it is
>>>>> not a issue to use it's static library because both Tesseract and
>>>> Leptonica
>>>>> is under apache licence.
>>>>>
>>>>> I'm working on the maven implementation you have mentioned and will get
>>>>> back to you soon.
>>>>>
>>>>> Thanks
>>>>> Dimuthu
>>>>>
>>>>>
>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>> [2]
>>>>>
>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>>>>>
>>>>>> Hi Dimuthu,
>>>>>>
>>>>>> 1,2,3:
>>>>>>
>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>> code as
>>>>>> you see fit.
>>>>>> The JNI binding should be minimal, only the methods you require need
>>>> to be
>>>>>> wrapped.
>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>> example if it is easier
>>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>>> pass the result
>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>>>
>>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>>> things progress.
>>>>>>
>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>> impression that it was
>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>
>>>>>> 4:  The native platform library should be built as part of the Maven
>>>> build
>>>>>> for the Tesseract
>>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>>> which contains the
>>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>>> binaries for all platforms
>>>>>> but this is something we can worry about later. Right now the goal
>>>> should
>>>>>> be to build a jar
>>>>>> containing just the current platform's native binary and any Java
>>>> wrapper
>>>>>> code.
>>>>>>
>>>>>> -- John
>>>>>>
>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi John,
>>>>>>>
>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>> observation
>>>>>>>
>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>>>
>>>>>>> 2. But I can understand underlying logic in each function. Basically
>>>> what
>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>> methods.
>>>>>>> In between it does to some image <=> byte array like conversions by
>>>> using
>>>>>>> that bitmap libraries in Android
>>>>>>>
>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>> with
>>>>>> our
>>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>>> will
>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>> implement using our codes
>>>>>>>
>>>>>>> I think 2nd solution is better because we need only few operations to
>>>> be
>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>> this.
>>>>>>> It's still not finished. I need to add some make files and build
>>>> files to
>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>> functions
>>>>>>> [3]. This may take some time.
>>>>>>>
>>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>>> so
>>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>>> the
>>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>>> add
>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>> preferred
>>>>>>> way?
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>
>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>> [4]
>>>>>>>
>>>>>>
>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>> [email protected]
>>>>>>>> wrote:
>>>>>>>
>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>
>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>> tessaract
>>>>>>>> api.
>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>> think
>>>>>> we
>>>>>>>> need to write our own make files to build this in to a dll(windows)
>>>> or
>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>>> for
>>>>>> a
>>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>>> suggest
>>>>>>>> if you have a better approach
>>>>>>>>
>>>>>>>> [1]
>>>>>>>>
>>>>>>
>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>> [2]
>>>>>>>>
>>>>>>
>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>> [3]
>>>>>>>>
>>>>>>
>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]>
>>>> wrote:
>>>>>>>>
>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>> component,
>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>> Extractor".
>>>>>>>>>
>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>> clear
>>>>>>>>> where the process starts.
>>>>>>>>>
>>>>>>>>> -- John
>>>>>>>>>
>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Dimuthu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]>
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>>> might
>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>> TesseractOCREngine
>>>>>>>>>>> class somewhere which provides the required functionality and
>>>> lives
>>>>>> in
>>>>>>>>> a
>>>>>>>>>>> separate jar file.
>>>>>>>>>>>
>>>>>>>>>>> -- John
>>>>>>>>>>>
>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]>
>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>>> like an
>>>>>>>>>>> API)?
>>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: "John Hewson" <[email protected]>
>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>>> Introduction
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>>> page
>>>>>>>>>>> rotation.
>>>>>>>>>>>>
>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>> have
>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>> glyphs. We
>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>
>>>>>>>>>>>> -- John
>>>>>>>>>>>>
>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>> [email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>> and
>>>>>>>>> some
>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>>> those
>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>> extracted
>>>>>>>>> using
>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>> [email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>>> accurate
>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>> OCR on
>>>>>>>>>>> those
>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>> wrong.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>>> extract
>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>>> are
>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>>> is
>>>>>>>>>>> useful:
>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>> extracted
>>>>>>>>> with
>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>> letters.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>> Configurations
>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>> code.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>> new
>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>> folder
>>>>>> of
>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>> project
>>>>>>>>> from
>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>>> the
>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>>> a
>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>> Tilman
>>>>>>>>> said
>>>>>>>>>>> I
>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>>> other
>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>> [email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>> PDFToText
>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>>> as
>>>>>> the
>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>>> and
>>>>>> I
>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>>> jars
>>>>>>>>> in
>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>> look
>>>>>>>>> into
>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>>> them
>>>>>>>>> in
>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>>> you
>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>> managed to
>>>>>>>>> do
>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>>> you a
>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>> [email protected]
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>> PDFBox
>>>>>>>>> for
>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>> details
>>>>>>>>> the
>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>>> are
>>>>>>>>> all
>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>> PageDrawer
>>>>>>>>>>> class
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>> (e.g.
>>>>>>>>> one
>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>>> text
>>>>>>>>> is
>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>> reading
>>>>>>>>>>> order
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>>> like
>>>>>>>>>>> HTML
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>> any
>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>> Undergraduate
>>>>>> at
>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>>> 2013
>>>>>>>>> with
>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>> image
>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>>> GSoC
>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>> me. In
>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards
>>>>>>>
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>>
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards
>>>>>
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> Department of Computer Science And Engineering
>>>>>
>>>>> University of Moratuwa, Sri Lanka
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards
>>>
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>>
>>> University of Moratuwa, Sri Lanka
>>>
>>
>>
>>
>> --
>> Regards
>>
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>>
>> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate

Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to