Hi John, Thanks for the guidance. I did a small analysis of the accuracy and performance of new Tesseract wrapper. I used this [1] image as the input image and got following data [2] after OCR. First line is the recognised word followed by location details (bounding box) of the word. I think these details are pretty much enough for our task. Now what remaining is converting pdf file into a image as you have mentioned. These days I'm working on it.
[1]https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF [2] https://gist.github.com/DImuthuUpe/9491660 Thanks Dimuthu On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <j...@jahewson.com> wrote: > Dimuthu, > >> I finished basic implementation of JNI wrapper for Tesseract. Now it can be >> build using maven. Some useful methods that are needed to do basic OCR were >> implemented. > > Great, it's looking good, nice and clean. > >> 1. What is the task of processStream method in PDFTextStripper class line >> 456 : processStream( page.findResources(), content, page.findCropBox(), >> page.findRotation() ); > > A PDF file is made up of pages, each of which contains a "content stream". > This content stream contains a list of drawing commands such as "move to > 10,15" or "write the word `foo`", these are called operators. The > processStream function reads the stream for the current page and executes > each of the operators. The operators themselves are implemented each in their > own class which is a subclass of PDFOperator. The constructor of > PDFStreamEngine creates the operator classes using reflection, which is > rather odd and I'm not sure why this design was chosen. The operators used by > PDFTextStripper can be found in > org/apache/pdfbox/resources/PDFTextStripper.properties > >> 2. Say I need to extract images and it's metadata from a pdf. What is the >> better approach to do it? > > You could subclass PDFTextStripper and override the startDocument method and > use it to create a PDFRenderer and store it in a field. Then override the > processPage method and use the previously created PDFRenderer to render the > current page to a buffered image and perform OCR on the image. Once you have > the OCR text + positions, instead of calling processStream you can call > processTextPosition once for each character + position. > > The PDFRenderer class was just added to the trunk, so make sure you do an > "svn update". Let me know if you need me to change PDFTextStripper to make it > easier to subclass. > > Cheers > > -- John > > On 9 Mar 2014, at 09:08, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: > >> Hi John, >> I finished basic implementation of JNI wrapper for Tesseract. Now it can be >> build using maven. Some useful methods that are needed to do basic OCR were >> implemented. >> >> I went through PDFBox code several times and got couple of issues that are >> needed to be clarified >> >> 1. What is the task of processStream method in PDFTextStripper class line >> 456 : processStream( page.findResources(), content, page.findCropBox(), >> page.findRotation() ); >> >> 2. Say I need to extract images and it's metadata from a pdf. What is the >> better approach to do it? >> >> Thanks >> Dimuthu >> >> >> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha >> <dimuthu.upeks...@gmail.com>wrote: >> >>> Hi John >>> I refactored Tesseract JNI code to support maven build. To create the JNI >>> library I added pre-built static libraries of Tesseract and Leptonica to >>> resources folder[2]. For now it includes librararies supported for mac. But >>> we can easily add both windows and linux libraries. After "mvn clean >>> install", the jar is created under target folder. Now all setting up is >>> done. What remains is implementing those native methods in tessbaseapi.cpp >>> [3]. Hope to finish it asap. Please let me know if there is any concern >>> about project structure. >>> >>> [1] https://github.com/DImuthuUpe/Tesseract-API.git >>> [2] >>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources >>> [3] >>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp >>> >>> Thanks >>> Dimuthu >>> >>> >>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote: >>> >>>> Dimuthu >>>> >>>>> There is a lot of code >>>>> fractions in current android jni wrapper which use "(jint)somePointer" >>>>> casting which will create terrible memory leaks in 64 bit environments >>>>> because ponters are 64 bit. So I believe writing it from the beginning >>>> is >>>>> much better. >>>> >>>> That's a classic 64-bit pitfall, well spotted. We definitely need to >>>> support >>>> 64-bit JVMs. >>>> >>>>> we can use >>>>> the static library of Leptonica (I did and it worked nicely). I think >>>> it is >>>>> not a issue to use it's static library because both Tesseract and >>>> Leptonica >>>>> is under apache licence. >>>> >>>> Sounds good, I found the following in the README: >>>> >>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles >>>> without Leptonica. >>>> >>>> Which makes sense. >>>> >>>> -- John >>>> >>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <dimuthu.upeks...@gmail.com> >>>> wrote: >>>> >>>>> Hi John, >>>>> +1 for you suggestion about converting image <=> byte array at java >>>> side. >>>>> It reduces lot of complexities. I don't know whether you have noticed or >>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my >>>> Mac >>>>> but don't know about other operating systems. >>>>> >>>>> Leptonica is the image processing library for Tesseract [1]. What >>>> tesseract >>>>> do is using image processing algorithms in Leptonica to implement its >>>> OCR >>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract >>>> API. >>>>> You can see it includes allheaders.h header file which is the main >>>> header >>>>> file of Leptonoca. So I think it is a must to build Leptonica first and >>>>> link it when we build Tesseract. This is not a big problem if we can use >>>>> the static library of Leptonica (I did and it worked nicely). I think >>>> it is >>>>> not a issue to use it's static library because both Tesseract and >>>> Leptonica >>>>> is under apache licence. >>>>> >>>>> I'm working on the maven implementation you have mentioned and will get >>>>> back to you soon. >>>>> >>>>> Thanks >>>>> Dimuthu >>>>> >>>>> >>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling >>>>> [2] >>>>> >>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp >>>>> >>>>> >>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote: >>>>> >>>>>> Hi Dimuthu, >>>>>> >>>>>> 1,2,3: >>>>>> >>>>>> Feel free to write your own Tesseract binding or port the existing >>>> code as >>>>>> you see fit. >>>>>> The JNI binding should be minimal, only the methods you require need >>>> to be >>>>>> wrapped. >>>>>> Also, don't forget that some of the interop can be done in Java, for >>>>>> example if it is easier >>>>>> to convert a BufferedImage to a byte array in Java then do it there and >>>>>> pass the result >>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result. >>>>>> >>>>>> Your GitHub repo looks like a good start, I can make comments there as >>>>>> things progress. >>>>>> >>>>>> Is it possible to build Tesseract without leptonica? I was under the >>>>>> impression that it was >>>>>> used for image i/o only, but I may be misinformed. >>>>>> >>>>>> 4: The native platform library should be built as part of the Maven >>>> build >>>>>> for the Tesseract >>>>>> wrapper which can be a separate project. The output can be a jar file >>>>>> which contains the >>>>>> native binaries. It should be possible for the jar to contain prebuilt >>>>>> binaries for all platforms >>>>>> but this is something we can worry about later. Right now the goal >>>> should >>>>>> be to build a jar >>>>>> containing just the current platform's native binary and any Java >>>> wrapper >>>>>> code. >>>>>> >>>>>> -- John >>>>>> >>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <dimuthu.upeks...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi John, >>>>>>> >>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my >>>>>>> observation >>>>>>> >>>>>>> 1. This wrapper heavily depends on android image libraries. >>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library. >>>>>>> >>>>>>> 2. But I can understand underlying logic in each function. Basically >>>> what >>>>>>> it does is mapping between tesseract api functions [2] with java >>>> methods. >>>>>>> In between it does to some image <=> byte array like conversions by >>>> using >>>>>>> that bitmap libraries in Android >>>>>>> >>>>>>> 3. There are two ways. 1: We can port it's code to make compatible >>>> with >>>>>> our >>>>>>> environments(linux,windows and mac) which is really painful. Also it >>>> will >>>>>>> cause memory leaks. 2: We can use only it's function signatures and >>>>>>> implement using our codes >>>>>>> >>>>>>> I think 2nd solution is better because we need only few operations to >>>> be >>>>>>> done using tesseract library. I have created a github repo [3] for >>>> this. >>>>>>> It's still not finished. I need to add some make files and build >>>> files to >>>>>>> make it run properly. And also I need to implement those wrapper >>>>>> functions >>>>>>> [3]. This may take some time. >>>>>>> >>>>>>> 4. Because we are calling native libraries we need different builds of >>>>>>> tesseract and leptonica libraries for each platform (dll for windows, >>>> so >>>>>>> for linux, dylib for mac). So we may need to build those libraries at >>>> the >>>>>>> time we build pdfbox project. Or we can pre build those libraries and >>>> add >>>>>>> them to the project as .dll, .so or .dylib format. What is the >>>> preferred >>>>>>> way? >>>>>>> >>>>>>> [1] >>>>>>> >>>>>> >>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp >>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample >>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API >>>>>>> [4] >>>>>>> >>>>>> >>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp >>>>>>> >>>>>>> Thanks >>>>>>> Dimuthu >>>>>>> >>>>>>> >>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha < >>>>>> dimuthu.upeks...@gmail.com >>>>>>>> wrote: >>>>>>> >>>>>>>> I updated necessary changes to the document [1] >>>>>>>> >>>>>>>> For last two days I had a deep look at this [2] jni wrapper for >>>>>> tessaract >>>>>>>> api. >>>>>>>> Unfortunately this has been designed for Android environment so I >>>> think >>>>>> we >>>>>>>> need to write our own make files to build this in to a dll(windows) >>>> or >>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching >>>> for >>>>>> a >>>>>>>> way to convert it to a make file that we can run on console. Please >>>>>> suggest >>>>>>>> if you have a better approach >>>>>>>> >>>>>>>> [1] >>>>>>>> >>>>>> >>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf >>>>>>>> [2] >>>>>>>> >>>>>> >>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/ >>>>>>>> [3] >>>>>>>> >>>>>> >>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <j...@jahewson.com> >>>> wrote: >>>>>>>> >>>>>>>>> This is a good start. However, there is no need for the Adder >>>>>> component, >>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text >>>>>> Extractor". >>>>>>>>> >>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it >>>> clear >>>>>>>>> where the process starts. >>>>>>>>> >>>>>>>>> -- John >>>>>>>>> >>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha < >>>> dimuthu.upeks...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1]. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> >>>>>>>>> >>>>>> >>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Dimuthu >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <j...@jahewson.com> >>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText >>>>>> might >>>>>>>>>>> use an interface, e.g. OCREngine and there will be a >>>>>> TesseractOCREngine >>>>>>>>>>> class somewhere which provides the required functionality and >>>> lives >>>>>> in >>>>>>>>> a >>>>>>>>>>> separate jar file. >>>>>>>>>>> >>>>>>>>>>> -- John >>>>>>>>>>> >>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <dimuthu.upeks...@gmail.com> >>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> So do you need to embed those new functionalities into existing >>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something >>>>>>>>> like an >>>>>>>>>>> API)? >>>>>>>>>>>> >>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>> From: "John Hewson" <j...@jahewson.com> >>>>>>>>>>>> Sent: 26/02/2014 07:38 >>>>>>>>>>>> To: "dev@pdfbox.apache.org" <dev@pdfbox.apache.org> >>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project - >>>>>>>>>>> Introduction >>>>>>>>>>>> >>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and >>>>>> page >>>>>>>>>>> rotation. >>>>>>>>>>>> >>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs >>>> have >>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong >>>>>>>>> glyphs. We >>>>>>>>>>> could OCR the glyphs to repair the encoding. >>>>>>>>>>>> >>>>>>>>>>>> -- John >>>>>>>>>>>> >>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha < >>>>>>>>> dimuthu.upeks...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi John, >>>>>>>>>>>>> Thanks for the explanation. >>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format >>>> and >>>>>>>>> some >>>>>>>>>>>>> images with text(Scanned images). In that case first we extract >>>>>> those >>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is >>>> extracted >>>>>>>>> using >>>>>>>>>>>>> OCR. Finally we pack both results together and give output as >>>>>>>>>>> PDFToText. Am >>>>>>>>>>>>> I correct? What do you mean by "location data"? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson < >>>> j...@jahewson.com> >>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. What is called "glyphs" ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph >>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2. What is the main requirement of this project? >>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of >>>>>>>>>>>>>>> malformed pdfs from >>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further >>>>>>>>>>> accurate >>>>>>>>>>>>>>> results. But the problem is, why shouldn't we directly do >>>> OCR on >>>>>>>>>>> those >>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm >>>> wrong. >>>>>>>>>>>>>> >>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text >>>>>>>>>>> (PDFToText). >>>>>>>>>>>>>> The goal of >>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to >>>>>>>>> extract >>>>>>>>>>>>>> text from areas of the >>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files >>>>>> are >>>>>>>>>>>>>> typically generated by >>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR >>>> is >>>>>>>>>>> useful: >>>>>>>>>>>>>> some fonts embedded >>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is >>>> extracted >>>>>>>>> with >>>>>>>>>>>>>> PDFToText the result is >>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct >>>>>> letters. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Instead of: >>>>>>>>>>>>>> PDF => Image => OCR => Text >>>>>>>>>>>>>> >>>>>>>>>>>>>> We want to do: >>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- John >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < >>>>>>>>>>>>>> dimuthu.upeks...@gmail.com >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Ok fixed. This is what I did >>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug >>>> Configurations >>>>>>>>>>>>>> ->Source >>>>>>>>>>>>>>>> ->Add -> Project >>>>>>>>>>>>>>>> Then I selected PDFBox project. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < >>>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java >>>>>>>>>>>>>> application >>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following >>>> code. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = >>>> new >>>>>>>>>>>>>> PDPage();document.addPage( blankPage >>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close(); >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target >>>> folder >>>>>> of >>>>>>>>>>> PDFBox >>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox >>>> project >>>>>>>>> from >>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check >>>> the >>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have >>>> a >>>>>>>>>>>>>> reference to >>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As >>>> Tilman >>>>>>>>> said >>>>>>>>>>> I >>>>>>>>>>>>>> built >>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it >>>>>> other >>>>>>>>>>>>>> projects >>>>>>>>>>>>>>>>> other than adding those jar files to build path. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson < >>>>>> j...@jahewson.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the >>>>>> PDFToText >>>>>>>>>>> class >>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path >>>> as >>>>>> the >>>>>>>>>>>>>> command >>>>>>>>>>>>>>>>>> line argument. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < >>>>>>>>>>>>>> dimuthu.upeks...@gmail.com> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and >>>>>>>>> managed to >>>>>>>>>>>>>>>>>> build >>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned >>>> and >>>>>> I >>>>>>>>>>> got a >>>>>>>>>>>>>>>>>> rough >>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the >>>>>> jars >>>>>>>>> in >>>>>>>>>>>>>>>>>> target >>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in >>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further >>>> look >>>>>>>>> into >>>>>>>>>>> code >>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in >>>>>>>>> PDFTextStripper >>>>>>>>>>>>>> class. >>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking >>>>>> them >>>>>>>>> in >>>>>>>>>>>>>> debug >>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way >>>>>> you >>>>>>>>>>> follow >>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>> order to do such task? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and >>>> managed to >>>>>>>>> do >>>>>>>>>>> some >>>>>>>>>>>>>>>>>> OCR >>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine. >>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop >>>>>> you a >>>>>>>>>>> mail. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson < >>>>>>>>> j...@jahewson.com >>>>>>>>>>>> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Dimuthu >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at >>>>>> http://pdfbox.apache.org/it >>>>>>>>>>>>>>>>>> contains >>>>>>>>>>>>>>>>>>>> a basic overview of the project >>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build >>>>>> PDFBox >>>>>>>>> for >>>>>>>>>>>>>>>>>> yourself. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 >>>> details >>>>>>>>> the >>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>> thoughts so far regarding it. >>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue >>>> are >>>>>>>>> all >>>>>>>>>>>>>> under >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> Apache license, which is a >>>>>>>>>>>>>>>>>>>> requirement. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the >>>> PageDrawer >>>>>>>>>>> class >>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>> how text and images are >>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level >>>> (e.g. >>>>>>>>> one >>>>>>>>>>>>>> glyph, >>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with >>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how >>>>>> text >>>>>>>>> is >>>>>>>>>>>>>>>>>> currently >>>>>>>>>>>>>>>>>>>> extracted, take a look at how >>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into >>>> reading >>>>>>>>>>> order >>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF >>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format >>>>>> like >>>>>>>>>>> HTML >>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>>> which is why extracting text can be so >>>>>>>>>>>>>>>>>>>> difficult sometimes. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at: >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>> >>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask >>>> any >>>>>>>>>>>>>> questions. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < >>>>>>>>>>>>>> dimuthu.upeks...@gmail.com >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering >>>> Undergraduate >>>>>> at >>>>>>>>>>>>>>>>>> University >>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC >>>> 2013 >>>>>>>>> with >>>>>>>>>>>>>>>>>> Apache >>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and >>>> image >>>>>>>>>>>>>> processing >>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my >>>>>> GSoC >>>>>>>>>>> 2014 >>>>>>>>>>>>>>>>>> project >>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for >>>> me. In >>>>>>>>>>>>>>>>>> university >>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group >>>>>>>>> wrote a >>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR >>>>>>>>>>>>>>>>>> systems(attached). Can >>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about >>>> PDFBox? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>> >>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thank you >>>>>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Regards >>>>>>>>>>>>> >>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>> >>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>> Undergraduate >>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>> >>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards >>>>>>>> >>>>>>>> W.Dimuthu Upeksha >>>>>>>> Undergraduate >>>>>>>> Department of Computer Science And Engineering >>>>>>>> >>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Regards >>>>>>> >>>>>>> W.Dimuthu Upeksha >>>>>>> Undergraduate >>>>>>> Department of Computer Science And Engineering >>>>>>> >>>>>>> University of Moratuwa, Sri Lanka >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Regards >>>>> >>>>> W.Dimuthu Upeksha >>>>> Undergraduate >>>>> Department of Computer Science And Engineering >>>>> >>>>> University of Moratuwa, Sri Lanka >>>> >>>> >>> >>> >>> -- >>> Regards >>> >>> W.Dimuthu Upeksha >>> Undergraduate >>> Department of Computer Science And Engineering >>> >>> University of Moratuwa, Sri Lanka >>> >> >> >> >> -- >> Regards >> >> W.Dimuthu Upeksha >> Undergraduate >> Department of Computer Science And Engineering >> >> University of Moratuwa, Sri Lanka > -- Regards W.Dimuthu Upeksha Undergraduate Department of Computer Science And Engineering University of Moratuwa, Sri Lanka