> You just need to use the following Apache header on your Java source files:
Actually, no, forget that. I don’t think you can use that header yet as you haven’t signed a CLA. Leave the files as they are without headers for now. We’ll deal with the licensing later because your code isn’t in the official Apache repository yet. -- John On 10 Mar 2014, at 01:30, John Hewson <j...@jahewson.com> wrote: > Dimuthu, > > That’s looking really good. You just need to use the following Apache header > on your Java source files: > > /* > * Licensed to the Apache Software Foundation (ASF) under one or more > * contributor license agreements. See the NOTICE file distributed with > * this work for additional information regarding copyright ownership. > * The ASF licenses this file to You under the Apache License, Version 2.0 > * (the "License"); you may not use this file except in compliance with > * the License. You may obtain a copy of the License at > * > * http://www.apache.org/licenses/LICENSE-2.0 > * > * Unless required by applicable law or agreed to in writing, software > * distributed under the License is distributed on an "AS IS" BASIS, > * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > * See the License for the specific language governing permissions and > * limitations under the License. > */ > > -- John > > On 7 Mar 2014, at 07:56, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: > >> Hi John >> I refactored Tesseract JNI code to support maven build. To create the JNI >> library I added pre-built static libraries of Tesseract and Leptonica to >> resources folder[2]. For now it includes librararies supported for mac. But >> we can easily add both windows and linux libraries. After "mvn clean >> install", the jar is created under target folder. Now all setting up is >> done. What remains is implementing those native methods in tessbaseapi.cpp >> [3]. Hope to finish it asap. Please let me know if there is any concern >> about project structure. >> >> [1] https://github.com/DImuthuUpe/Tesseract-API.git >> [2] >> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources >> [3] >> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp >> >> Thanks >> Dimuthu >> >> >> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote: >> >>> Dimuthu >>> >>>> There is a lot of code >>>> fractions in current android jni wrapper which use "(jint)somePointer" >>>> casting which will create terrible memory leaks in 64 bit environments >>>> because ponters are 64 bit. So I believe writing it from the beginning is >>>> much better. >>> >>> That's a classic 64-bit pitfall, well spotted. We definitely need to >>> support >>> 64-bit JVMs. >>> >>>> we can use >>>> the static library of Leptonica (I did and it worked nicely). I think it >>> is >>>> not a issue to use it's static library because both Tesseract and >>> Leptonica >>>> is under apache licence. >>> >>> Sounds good, I found the following in the README: >>> >>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles >>> without Leptonica. >>> >>> Which makes sense. >>> >>> -- John >>> >>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <dimuthu.upeks...@gmail.com> >>> wrote: >>> >>>> Hi John, >>>> +1 for you suggestion about converting image <=> byte array at java side. >>>> It reduces lot of complexities. I don't know whether you have noticed or >>>> not, jint data type in jni is a 32bit integer type. I noticed it in my >>> Mac >>>> but don't know about other operating systems. >>>> >>>> Leptonica is the image processing library for Tesseract [1]. What >>> tesseract >>>> do is using image processing algorithms in Leptonica to implement its OCR >>>> algorithms. This [2] is the responsible .cpp file to create Tesseract >>> API. >>>> You can see it includes allheaders.h header file which is the main header >>>> file of Leptonoca. So I think it is a must to build Leptonica first and >>>> link it when we build Tesseract. This is not a big problem if we can use >>>> the static library of Leptonica (I did and it worked nicely). I think it >>> is >>>> not a issue to use it's static library because both Tesseract and >>> Leptonica >>>> is under apache licence. >>>> >>>> I'm working on the maven implementation you have mentioned and will get >>>> back to you soon. >>>> >>>> Thanks >>>> Dimuthu >>>> >>>> >>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling >>>> [2] >>>> >>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp >>>> >>>> >>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote: >>>> >>>>> Hi Dimuthu, >>>>> >>>>> 1,2,3: >>>>> >>>>> Feel free to write your own Tesseract binding or port the existing code >>> as >>>>> you see fit. >>>>> The JNI binding should be minimal, only the methods you require need to >>> be >>>>> wrapped. >>>>> Also, don't forget that some of the interop can be done in Java, for >>>>> example if it is easier >>>>> to convert a BufferedImage to a byte array in Java then do it there and >>>>> pass the result >>>>> to JNI rather than writing lots of JNI C++ to achieve the same result. >>>>> >>>>> Your GitHub repo looks like a good start, I can make comments there as >>>>> things progress. >>>>> >>>>> Is it possible to build Tesseract without leptonica? I was under the >>>>> impression that it was >>>>> used for image i/o only, but I may be misinformed. >>>>> >>>>> 4: The native platform library should be built as part of the Maven >>> build >>>>> for the Tesseract >>>>> wrapper which can be a separate project. The output can be a jar file >>>>> which contains the >>>>> native binaries. It should be possible for the jar to contain prebuilt >>>>> binaries for all platforms >>>>> but this is something we can worry about later. Right now the goal >>> should >>>>> be to build a jar >>>>> containing just the current platform's native binary and any Java >>> wrapper >>>>> code. >>>>> >>>>> -- John >>>>> >>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <dimuthu.upeks...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi John, >>>>>> >>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my >>>>>> observation >>>>>> >>>>>> 1. This wrapper heavily depends on android image libraries. >>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library. >>>>>> >>>>>> 2. But I can understand underlying logic in each function. Basically >>> what >>>>>> it does is mapping between tesseract api functions [2] with java >>> methods. >>>>>> In between it does to some image <=> byte array like conversions by >>> using >>>>>> that bitmap libraries in Android >>>>>> >>>>>> 3. There are two ways. 1: We can port it's code to make compatible with >>>>> our >>>>>> environments(linux,windows and mac) which is really painful. Also it >>> will >>>>>> cause memory leaks. 2: We can use only it's function signatures and >>>>>> implement using our codes >>>>>> >>>>>> I think 2nd solution is better because we need only few operations to >>> be >>>>>> done using tesseract library. I have created a github repo [3] for >>> this. >>>>>> It's still not finished. I need to add some make files and build files >>> to >>>>>> make it run properly. And also I need to implement those wrapper >>>>> functions >>>>>> [3]. This may take some time. >>>>>> >>>>>> 4. Because we are calling native libraries we need different builds of >>>>>> tesseract and leptonica libraries for each platform (dll for windows, >>> so >>>>>> for linux, dylib for mac). So we may need to build those libraries at >>> the >>>>>> time we build pdfbox project. Or we can pre build those libraries and >>> add >>>>>> them to the project as .dll, .so or .dylib format. What is the >>> preferred >>>>>> way? >>>>>> >>>>>> [1] >>>>>> >>>>> >>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp >>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample >>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API >>>>>> [4] >>>>>> >>>>> >>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp >>>>>> >>>>>> Thanks >>>>>> Dimuthu >>>>>> >>>>>> >>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha < >>>>> dimuthu.upeks...@gmail.com >>>>>>> wrote: >>>>>> >>>>>>> I updated necessary changes to the document [1] >>>>>>> >>>>>>> For last two days I had a deep look at this [2] jni wrapper for >>>>> tessaract >>>>>>> api. >>>>>>> Unfortunately this has been designed for Android environment so I >>> think >>>>> we >>>>>>> need to write our own make files to build this in to a dll(windows) or >>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching >>> for >>>>> a >>>>>>> way to convert it to a make file that we can run on console. Please >>>>> suggest >>>>>>> if you have a better approach >>>>>>> >>>>>>> [1] >>>>>>> >>>>> >>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf >>>>>>> [2] >>>>>>> >>>>> >>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/ >>>>>>> [3] >>>>>>> >>>>> >>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk >>>>>>> >>>>>>> >>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <j...@jahewson.com> >>> wrote: >>>>>>> >>>>>>>> This is a good start. However, there is no need for the Adder >>>>> component, >>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text >>>>> Extractor". >>>>>>>> >>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it >>> clear >>>>>>>> where the process starts. >>>>>>>> >>>>>>>> -- John >>>>>>>> >>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha < >>> dimuthu.upeks...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1]. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> >>>>>>>> >>>>> >>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Dimuthu >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <j...@jahewson.com> >>>>> wrote: >>>>>>>>> >>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText >>>>> might >>>>>>>>>> use an interface, e.g. OCREngine and there will be a >>>>> TesseractOCREngine >>>>>>>>>> class somewhere which provides the required functionality and lives >>>>> in >>>>>>>> a >>>>>>>>>> separate jar file. >>>>>>>>>> >>>>>>>>>> -- John >>>>>>>>>> >>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <dimuthu.upeks...@gmail.com> >>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> So do you need to embed those new functionalities into existing >>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something >>>>>>>> like an >>>>>>>>>> API)? >>>>>>>>>>> >>>>>>>>>>> -----Original Message----- >>>>>>>>>>> From: "John Hewson" <j...@jahewson.com> >>>>>>>>>>> Sent: 26/02/2014 07:38 >>>>>>>>>>> To: "dev@pdfbox.apache.org" <dev@pdfbox.apache.org> >>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project - >>>>>>>>>> Introduction >>>>>>>>>>> >>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and >>>>> page >>>>>>>>>> rotation. >>>>>>>>>>> >>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs >>> have >>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong >>>>>>>> glyphs. We >>>>>>>>>> could OCR the glyphs to repair the encoding. >>>>>>>>>>> >>>>>>>>>>> -- John >>>>>>>>>>> >>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha < >>>>>>>> dimuthu.upeks...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi John, >>>>>>>>>>>> Thanks for the explanation. >>>>>>>>>>>> Let's say there is a pdf with both text in extractable format and >>>>>>>> some >>>>>>>>>>>> images with text(Scanned images). In that case first we extract >>>>> those >>>>>>>>>>>> extractable content using PDFBox algorithms and rest is extracted >>>>>>>> using >>>>>>>>>>>> OCR. Finally we pack both results together and give output as >>>>>>>>>> PDFToText. Am >>>>>>>>>>>> I correct? What do you mean by "location data"? >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Dimuthu >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson < >>> j...@jahewson.com> >>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> 1. What is called "glyphs" ? >>>>>>>>>>>>> >>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph >>>>>>>>>>>>> >>>>>>>>>>>>>> 2. What is the main requirement of this project? >>>>>>>>>>>>>> As far as I understood, first we need to generate an image of >>>>>>>>>>>>>> malformed pdfs from >>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further >>>>>>>>>> accurate >>>>>>>>>>>>>> results. But the problem is, why shouldn't we directly do OCR >>> on >>>>>>>>>> those >>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm >>> wrong. >>>>>>>>>>>>> >>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text >>>>>>>>>> (PDFToText). >>>>>>>>>>>>> The goal of >>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to >>>>>>>> extract >>>>>>>>>>>>> text from areas of the >>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files >>>>> are >>>>>>>>>>>>> typically generated by >>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR >>> is >>>>>>>>>> useful: >>>>>>>>>>>>> some fonts embedded >>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is >>> extracted >>>>>>>> with >>>>>>>>>>>>> PDFToText the result is >>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct >>>>> letters. >>>>>>>>>>>>> >>>>>>>>>>>>> Instead of: >>>>>>>>>>>>> PDF => Image => OCR => Text >>>>>>>>>>>>> >>>>>>>>>>>>> We want to do: >>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text >>>>>>>>>>>>> >>>>>>>>>>>>> -- John >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < >>>>>>>>>>>>> dimuthu.upeks...@gmail.com >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ok fixed. This is what I did >>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug >>> Configurations >>>>>>>>>>>>> ->Source >>>>>>>>>>>>>>> ->Add -> Project >>>>>>>>>>>>>>> Then I selected PDFBox project. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < >>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java >>>>>>>>>>>>> application >>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following >>> code. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new >>>>>>>>>>>>> PDPage();document.addPage( blankPage >>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close(); >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Then I need to add those jar files generated in target folder >>>>> of >>>>>>>>>> PDFBox >>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox >>> project >>>>>>>> from >>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check >>> the >>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have a >>>>>>>>>>>>> reference to >>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As >>> Tilman >>>>>>>> said >>>>>>>>>> I >>>>>>>>>>>>> built >>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it >>>>> other >>>>>>>>>>>>> projects >>>>>>>>>>>>>>>> other than adding those jar files to build path. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson < >>>>> j...@jahewson.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the >>>>> PDFToText >>>>>>>>>> class >>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as >>>>> the >>>>>>>>>>>>> command >>>>>>>>>>>>>>>>> line argument. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < >>>>>>>>>>>>> dimuthu.upeks...@gmail.com> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and >>>>>>>> managed to >>>>>>>>>>>>>>>>> build >>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned >>> and >>>>> I >>>>>>>>>> got a >>>>>>>>>>>>>>>>> rough >>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the >>>>> jars >>>>>>>> in >>>>>>>>>>>>>>>>> target >>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in >>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look >>>>>>>> into >>>>>>>>>> code >>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in >>>>>>>> PDFTextStripper >>>>>>>>>>>>> class. >>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking >>>>> them >>>>>>>> in >>>>>>>>>>>>> debug >>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way >>>>> you >>>>>>>>>> follow >>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>> order to do such task? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and managed >>> to >>>>>>>> do >>>>>>>>>> some >>>>>>>>>>>>>>>>> OCR >>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine. >>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop >>>>> you a >>>>>>>>>> mail. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson < >>>>>>>> j...@jahewson.com >>>>>>>>>>> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Dimuthu >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The PDFBox website can be found at >>>>> http://pdfbox.apache.org/it >>>>>>>>>>>>>>>>> contains >>>>>>>>>>>>>>>>>>> a basic overview of the project >>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build >>>>> PDFBox >>>>>>>> for >>>>>>>>>>>>>>>>> yourself. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 >>> details >>>>>>>> the >>>>>>>>>> only >>>>>>>>>>>>>>>>>>> thoughts so far regarding it. >>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue >>> are >>>>>>>> all >>>>>>>>>>>>> under >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> Apache license, which is a >>>>>>>>>>>>>>>>>>> requirement. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the >>> PageDrawer >>>>>>>>>> class >>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>> how text and images are >>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level >>> (e.g. >>>>>>>> one >>>>>>>>>>>>> glyph, >>>>>>>>>>>>>>>>>>> word, or sentence at a time) with >>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how >>>>> text >>>>>>>> is >>>>>>>>>>>>>>>>> currently >>>>>>>>>>>>>>>>>>> extracted, take a look at how >>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into >>> reading >>>>>>>>>> order >>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF >>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format >>>>> like >>>>>>>>>> HTML >>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>> which is why extracting text can be so >>>>>>>>>>>>>>>>>>> difficult sometimes. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at: >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>> >>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any >>>>>>>>>>>>> questions. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < >>>>>>>>>>>>> dimuthu.upeks...@gmail.com >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering >>> Undergraduate >>>>> at >>>>>>>>>>>>>>>>> University >>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC >>> 2013 >>>>>>>> with >>>>>>>>>>>>>>>>> Apache >>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and >>> image >>>>>>>>>>>>> processing >>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my >>>>> GSoC >>>>>>>>>> 2014 >>>>>>>>>>>>>>>>> project >>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for me. >>> In >>>>>>>>>>>>>>>>> university >>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group >>>>>>>> wrote a >>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR >>>>>>>>>>>>>>>>> systems(attached). Can >>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about >>> PDFBox? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>> >>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thank you >>>>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Regards >>>>>>>>>>>>>> >>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>> >>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Regards >>>>>>>>>>>> >>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>> Undergraduate >>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>> >>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> W.Dimuthu Upeksha >>>>>>>>> Undergraduate >>>>>>>>> Department of Computer Science And Engineering >>>>>>>>> >>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Regards >>>>>>> >>>>>>> W.Dimuthu Upeksha >>>>>>> Undergraduate >>>>>>> Department of Computer Science And Engineering >>>>>>> >>>>>>> University of Moratuwa, Sri Lanka >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards >>>>>> >>>>>> W.Dimuthu Upeksha >>>>>> Undergraduate >>>>>> Department of Computer Science And Engineering >>>>>> >>>>>> University of Moratuwa, Sri Lanka >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards >>>> >>>> W.Dimuthu Upeksha >>>> Undergraduate >>>> Department of Computer Science And Engineering >>>> >>>> University of Moratuwa, Sri Lanka >>> >>> >> >> >> -- >> Regards >> >> W.Dimuthu Upeksha >> Undergraduate >> Department of Computer Science And Engineering >> >> University of Moratuwa, Sri Lanka >