Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Mon, 10 Mar 2014 02:01:27 -0700

Dimuthu,

> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
> build using maven. Some useful methods that are needed to do basic OCR were
> implemented.


Great, it’s looking good, nice and clean.

> 1. What is the task of processStream method in PDFTextStripper class line
> 456 : processStream( page.findResources(), content, page.findCropBox(),
> page.findRotation() );

A PDF file is made up of pages, each of which contains a “content stream”. This 
content stream contains a list of drawing commands such as “move to 10,15” or 
“write the word `foo`”, these are called operators. The processStream function 
reads the stream for the current page and executes each of the operators. The 
operators themselves are implemented each in their own class which is a 
subclass of PDFOperator. The constructor of PDFStreamEngine creates the 
operator classes using reflection, which is rather odd and I’m not sure why 
this design was chosen. The operators used by PDFTextStripper can be found in 
org/apache/pdfbox/resources/PDFTextStripper.properties

> 2. Say I need to extract images and it's metadata from a pdf. What is the 
> better approach to do it?

You could subclass PDFTextStripper and override the startDocument method and 
use it to create a PDFRenderer and store it in a field. Then override the 
processPage method and use the previously created PDFRenderer to render the 
current page to a buffered image and perform OCR on the image. Once you have 
the OCR text + positions, instead of calling processStream you can call 
processTextPosition once for each character + position.

The PDFRenderer class was just added to the trunk, so make sure you do an “svn 
update”. Let me know if you need me to change PDFTextStripper to make it easier 
to subclass.

Cheers

-- John

On 9 Mar 2014, at 09:08, DImuthu Upeksha <[email protected]> wrote:

> Hi John,
> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
> build using maven. Some useful methods that are needed to do basic OCR were
> implemented.
> 
> I went through PDFBox code several times and got couple of issues that are
> needed to be clarified
> 
> 1. What is the task of processStream method in PDFTextStripper class line
> 456 : processStream( page.findResources(), content, page.findCropBox(),
> page.findRotation() );
> 
> 2. Say I need to extract images and it's metadata from a pdf. What is the
> better approach to do it?
> 
> Thanks
> Dimuthu
> 
> 
> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
> <[email protected]>wrote:
> 
>> Hi John
>> I refactored Tesseract JNI code to support maven build. To create the JNI
>> library I added pre-built static libraries of Tesseract and Leptonica to
>> resources folder[2]. For now it includes librararies supported for mac. But
>> we can easily add both windows and linux libraries. After "mvn clean
>> install", the jar is created under target folder. Now all setting up is
>> done. What remains is implementing those native methods in tessbaseapi.cpp
>> [3]. Hope to finish it asap. Please let me know if there is any concern
>> about project structure.
>> 
>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>> [2]
>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>> [3]
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>> 
>> Thanks
>> Dimuthu
>> 
>> 
>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>> 
>>> Dimuthu
>>> 
>>>> There is a lot of code
>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>> casting which will create terrible memory leaks in 64 bit environments
>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>> is
>>>> much better.
>>> 
>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>> support
>>> 64-bit JVMs.
>>> 
>>>> we can use
>>>> the static library of Leptonica (I did and it worked nicely). I think
>>> it is
>>>> not a issue to use it's static library because both Tesseract and
>>> Leptonica
>>>> is under apache licence.
>>> 
>>> Sounds good, I found the following in the README:
>>> 
>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>> without Leptonica.
>>> 
>>> Which makes sense.
>>> 
>>> -- John
>>> 
>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <[email protected]>
>>> wrote:
>>> 
>>>> Hi John,
>>>> +1 for you suggestion about converting image <=> byte array at java
>>> side.
>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>> Mac
>>>> but don't know about other operating systems.
>>>> 
>>>> Leptonica is the image processing library for Tesseract [1]. What
>>> tesseract
>>>> do is using image processing algorithms in Leptonica to implement its
>>> OCR
>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>> API.
>>>> You can see it includes allheaders.h header file which is the main
>>> header
>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>> the static library of Leptonica (I did and it worked nicely). I think
>>> it is
>>>> not a issue to use it's static library because both Tesseract and
>>> Leptonica
>>>> is under apache licence.
>>>> 
>>>> I'm working on the maven implementation you have mentioned and will get
>>>> back to you soon.
>>>> 
>>>> Thanks
>>>> Dimuthu
>>>> 
>>>> 
>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>> [2]
>>>> 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>> 
>>>> 
>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>>>> 
>>>>> Hi Dimuthu,
>>>>> 
>>>>> 1,2,3:
>>>>> 
>>>>> Feel free to write your own Tesseract binding or port the existing
>>> code as
>>>>> you see fit.
>>>>> The JNI binding should be minimal, only the methods you require need
>>> to be
>>>>> wrapped.
>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>> example if it is easier
>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>> pass the result
>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>> 
>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>> things progress.
>>>>> 
>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>> impression that it was
>>>>> used for image i/o only, but I may be misinformed.
>>>>> 
>>>>> 4:  The native platform library should be built as part of the Maven
>>> build
>>>>> for the Tesseract
>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>> which contains the
>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>> binaries for all platforms
>>>>> but this is something we can worry about later. Right now the goal
>>> should
>>>>> be to build a jar
>>>>> containing just the current platform's native binary and any Java
>>> wrapper
>>>>> code.
>>>>> 
>>>>> -- John
>>>>> 
>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi John,
>>>>>> 
>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>> observation
>>>>>> 
>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>> 
>>>>>> 2. But I can understand underlying logic in each function. Basically
>>> what
>>>>>> it does is mapping between tesseract api functions [2] with java
>>> methods.
>>>>>> In between it does to some image <=> byte array like conversions by
>>> using
>>>>>> that bitmap libraries in Android
>>>>>> 
>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>> with
>>>>> our
>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>> will
>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>> implement using our codes
>>>>>> 
>>>>>> I think 2nd solution is better because we need only few operations to
>>> be
>>>>>> done using tesseract library. I have created a github repo [3] for
>>> this.
>>>>>> It's still not finished. I need to add some make files and build
>>> files to
>>>>>> make it run properly. And also I need to implement those wrapper
>>>>> functions
>>>>>> [3]. This may take some time.
>>>>>> 
>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>> so
>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>> the
>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>> add
>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>> preferred
>>>>>> way?
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>> [4]
>>>>>> 
>>>>> 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>> 
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>> 
>>>>>> 
>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>> [email protected]
>>>>>>> wrote:
>>>>>> 
>>>>>>> I updated necessary changes to the document [1]
>>>>>>> 
>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>> tessaract
>>>>>>> api.
>>>>>>> Unfortunately this has been designed for Android environment so I
>>> think
>>>>> we
>>>>>>> need to write our own make files to build this in to a dll(windows)
>>> or
>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>> for
>>>>> a
>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>> suggest
>>>>>>> if you have a better approach
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>> 
>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>> [2]
>>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>> [3]
>>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]>
>>> wrote:
>>>>>>> 
>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>> component,
>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>> Extractor".
>>>>>>>> 
>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>> clear
>>>>>>>> where the process starts.
>>>>>>>> 
>>>>>>>> -- John
>>>>>>>> 
>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Dimuthu
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>> might
>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>> TesseractOCREngine
>>>>>>>>>> class somewhere which provides the required functionality and
>>> lives
>>>>> in
>>>>>>>> a
>>>>>>>>>> separate jar file.
>>>>>>>>>> 
>>>>>>>>>> -- John
>>>>>>>>>> 
>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>> like an
>>>>>>>>>> API)?
>>>>>>>>>>> 
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: "John Hewson" <[email protected]>
>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>> Introduction
>>>>>>>>>>> 
>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>> page
>>>>>>>>>> rotation.
>>>>>>>>>>> 
>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>> have
>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>> glyphs. We
>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>> 
>>>>>>>>>>> -- John
>>>>>>>>>>> 
>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi John,
>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>> and
>>>>>>>> some
>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>> those
>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>> extracted
>>>>>>>> using
>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>> accurate
>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>> OCR on
>>>>>>>>>> those
>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>> wrong.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>> extract
>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>> are
>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>> is
>>>>>>>>>> useful:
>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>> extracted
>>>>>>>> with
>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>> letters.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- John
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>> Configurations
>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>> code.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>> new
>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>> folder
>>>>> of
>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>> project
>>>>>>>> from
>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>> the
>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>> a
>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>> Tilman
>>>>>>>> said
>>>>>>>>>> I
>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>> other
>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>> [email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>> PDFToText
>>>>>>>>>> class
>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>> as
>>>>> the
>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>> managed to
>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>> and
>>>>> I
>>>>>>>>>> got a
>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>> jars
>>>>>>>> in
>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>> look
>>>>>>>> into
>>>>>>>>>> code
>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>> PDFTextStripper
>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>> them
>>>>>>>> in
>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>> you
>>>>>>>>>> follow
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>> managed to
>>>>>>>> do
>>>>>>>>>> some
>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>> you a
>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>> [email protected]
>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>> PDFBox
>>>>>>>> for
>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>> details
>>>>>>>> the
>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>> are
>>>>>>>> all
>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>> PageDrawer
>>>>>>>>>> class
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>> (e.g.
>>>>>>>> one
>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>> text
>>>>>>>> is
>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>> reading
>>>>>>>>>> order
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>> like
>>>>>>>>>> HTML
>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>> any
>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>> Undergraduate
>>>>> at
>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>> 2013
>>>>>>>> with
>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>> image
>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>> GSoC
>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>> me. In
>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>> PDFBox?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>> 
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>> 
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> 
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Regards
>>>>>>> 
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>> 
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>>> 
>>> 
>> 
>> 
>> --
>> Regards
>> 
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>> 
>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to