Dimuthu Your new diagram looks good. The JNI wrapper for Tesseract is indeed for Android, so it will need porting to a standard desktop C++ environment. We use Maven to build PDFBox and there is a native-maven plugin which can build JNI projects, see http://docs.codehaus.org/display/MAVENUSER/Projects+With+JNI the plugin itself is here http://mojo.codehaus.org/maven-native/native-maven-plugin/.
If you’ve not used Maven before, it’s a Java build system with its own package repository (like rubygems or npm) so you just write an XML file and it downloads the appropriate plugins at build-time as they are required. What operating system do you develop on? I’m on OS X, but I have VMs for most platforms. Thanks -- John On 1 Mar 2014, at 10:09, DImuthu Upeksha <[email protected]> wrote: > I updated necessary changes to the document [1] > > For last two days I had a deep look at this [2] jni wrapper for tessaract > api. > Unfortunately this has been designed for Android environment so I think we > need to write our own make files to build this in to a dll(windows) or > dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a > way to convert it to a make file that we can run on console. Please suggest > if you have a better approach > > [1] > https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf > [2] > https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/ > [3] > https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk > > > On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]> wrote: > >> This is a good start. However, there is no need for the Adder component, >> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor". >> >> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear >> where the process starts. >> >> -- John >> >> On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]> >> wrote: >> >>> Sorry for the mistake. I added it to my Dropbox [1]. >>> >>> [1] >>> >> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf >>> >>> Thanks >>> Dimuthu >>> >>> >>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]> wrote: >>> >>>> I should add that the OCR engine should be pluggable so PDFToText might >>>> use an interface, e.g. OCREngine and there will be a TesseractOCREngine >>>> class somewhere which provides the required functionality and lives in a >>>> separate jar file. >>>> >>>> -- John >>>> >>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> wrote: >>>>> >>>>> So do you need to embed those new functionalities into existing >>>> PDFtoText algorithms or package them as a new sub system(something like >> an >>>> API)? >>>>> >>>>> -----Original Message----- >>>>> From: "John Hewson" <[email protected]> >>>>> Sent: 26/02/2014 07:38 >>>>> To: "[email protected]" <[email protected]> >>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project - >>>> Introduction >>>>> >>>>> Yes, exactly. By location data I just mean (x,y) coordinates and page >>>> rotation. >>>>> >>>>> There is another use case for OCR: some fonts embedded in PDFs have >>>> corrupt encodings, which means the ACSII codes map to the wrong glyphs. >> We >>>> could OCR the glyphs to repair the encoding. >>>>> >>>>> -- John >>>>> >>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <[email protected] >>> >>>> wrote: >>>>>> >>>>>> Hi John, >>>>>> Thanks for the explanation. >>>>>> Let's say there is a pdf with both text in extractable format and some >>>>>> images with text(Scanned images). In that case first we extract those >>>>>> extractable content using PDFBox algorithms and rest is extracted >> using >>>>>> OCR. Finally we pack both results together and give output as >>>> PDFToText. Am >>>>>> I correct? What do you mean by "location data"? >>>>>> >>>>>> Thanks >>>>>> Dimuthu >>>>>> >>>>>> >>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]> >>>> wrote: >>>>>>> >>>>>>> 1. What is called "glyphs" ? >>>>>>> >>>>>>> http://en.wikipedia.org/wiki/Glyph >>>>>>> >>>>>>>> 2. What is the main requirement of this project? >>>>>>>> As far as I understood, first we need to generate an image of >>>>>>>> malformed pdfs from >>>>>>>> PDFBox and then we need to do processing using OCR for further >>>> accurate >>>>>>>> results. But the problem is, why shouldn't we directly do OCR on >>>> those >>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong. >>>>>>> >>>>>>> PDFBox can generate images (PDFToImage) and can extract text >>>> (PDFToText). >>>>>>> The goal of >>>>>>> this project is to enhance PDFToText so that it can use OCR to >> extract >>>>>>> text from areas of the >>>>>>> document where the text is embedded as an image. Such PDF files are >>>>>>> typically generated by >>>>>>> scanners or fax machines. There is also another case where OCR is >>>> useful: >>>>>>> some fonts embedded >>>>>>> in PDF files contain the wrong encoding, so when text is extracted >> with >>>>>>> PDFToText the result is >>>>>>> nonsense but when drawn with PDFToImage we see the correct letters. >>>>>>> >>>>>>> Instead of: >>>>>>> PDF => Image => OCR => Text >>>>>>> >>>>>>> We want to do: >>>>>>> PDF => (Many images for words + location data => OCR) => Text >>>>>>> >>>>>>> -- John >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < >>>>>>> [email protected] >>>>>>>>> wrote: >>>>>>>> >>>>>>>>> Ok fixed. This is what I did >>>>>>>>> Right click on the new project ->Debug As-> Debug Configurations >>>>>>> ->Source >>>>>>>>> ->Add -> Project >>>>>>>>> Then I selected PDFBox project. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Dimuthu >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java >>>>>>> application >>>>>>>>>> project (say TestPDFBox) with a main class with following code. >>>>>>>>>> >>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new >>>>>>> PDPage();document.addPage( blankPage >>>>>>> );document.save("BlankPage.pdf");document.close(); >>>>>>>>>> >>>>>>>>>> Then I need to add those jar files generated in target folder of >>>> PDFBox >>>>>>>>>> to build path of my new project (I did build the PDFBox project >> from >>>>>>>>>> source). That is what I did. But let's say I need to check the >>>>>>>>>> functionality of document.save("") method. But I don't have a >>>>>>> reference to >>>>>>>>>> it's sources because I directly used generated jars. As Tilman >> said >>>> I >>>>>>> built >>>>>>>>>> PDFBox from sources but I don't know a proper way to use it other >>>>>>> projects >>>>>>>>>> other than adding those jar files to build path. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Which IDE are you using? You should be able to run the PDFToText >>>> class >>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the >>>>>>> command >>>>>>>>>>> line argument. >>>>>>>>>>> >>>>>>>>>>> -- John >>>>>>>>>>> >>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < >>>>>>> [email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi John, >>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed >> to >>>>>>>>>>> build >>>>>>>>>>>> code successfully. I looked at the classes you mentioned and I >>>> got a >>>>>>>>>>> rough >>>>>>>>>>>> idea about how they are working. To check them I used the jars >> in >>>>>>>>>>> target >>>>>>>>>>>> folder to my separate java project. I tried samples in >>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into >>>> code >>>>>>>>>>>> specially how those processXXX() methods work in PDFTextStripper >>>>>>> class. >>>>>>>>>>>> What I usually do is adding some berakpoints and checking them >> in >>>>>>> debug >>>>>>>>>>>> windows. But using jars it's not possible. What is the way you >>>> follow >>>>>>>>>>> in >>>>>>>>>>>> order to do such task? >>>>>>>>>>>> >>>>>>>>>>>> As well I installed tesseract in to my machine and managed to do >>>> some >>>>>>>>>>> OCR >>>>>>>>>>>> stuff also. That's a cool tool which works fine. >>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a >>>> mail. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Dimuthu >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson < >> [email protected] >>>>> >>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Dimuthu >>>>>>>>>>>>> >>>>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it >>>>>>>>>>> contains >>>>>>>>>>>>> a basic overview of the project >>>>>>>>>>>>> and details on how to obtain the source code and build PDFBox >> for >>>>>>>>>>> yourself. >>>>>>>>>>>>> >>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the >>>> only >>>>>>>>>>>>> thoughts so far regarding it. >>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all >>>>>>> under >>>>>>>>>>> the >>>>>>>>>>>>> Apache license, which is a >>>>>>>>>>>>> requirement. >>>>>>>>>>>>> >>>>>>>>>>>>> Once you have the source code, take a look at the PageDrawer >>>> class >>>>>>> to >>>>>>>>>>> see >>>>>>>>>>>>> how text and images are >>>>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one >>>>>>> glyph, >>>>>>>>>>>>> word, or sentence at a time) with >>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text >> is >>>>>>>>>>> currently >>>>>>>>>>>>> extracted, take a look at how >>>>>>>>>>>>> we have to go to great length to sort text back into reading >>>> order >>>>>>> and >>>>>>>>>>>>> infer the placement of diacritics - PDF >>>>>>>>>>>>> is fundamentally a visual format, not a structured format like >>>> HTML >>>>>>> - >>>>>>>>>>>>> which is why extracting text can be so >>>>>>>>>>>>> difficult sometimes. >>>>>>>>>>>>> >>>>>>>>>>>>> The full PDF Reference document can be found at: >>>>>>> >>>> >> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>>>>>>>>>>>> >>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any >>>>>>> questions. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> -- John >>>>>>>>>>>>> >>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < >>>>>>> [email protected] >>>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at >>>>>>>>>>> University >>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 >> with >>>>>>>>>>> Apache >>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image >>>>>>> processing >>>>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC >>>> 2014 >>>>>>>>>>> project >>>>>>>>>>>>> because I feel like it is the best suited project for me. In >>>>>>>>>>> university >>>>>>>>>>>>> also we have done some research in OCR area and our group >> wrote a >>>>>>>>>>>>> literature review about increasing efficiency of OCR >>>>>>>>>>> systems(attached). Can >>>>>>>>>>>>> you please suggest me where to start learning about PDFBox? >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>> >>>> >> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you >>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Regards >>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Regards >>>>>>>>>>>> >>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>> Undergraduate >>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>> >>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>> Undergraduate >>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>> >>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> W.Dimuthu Upeksha >>>>>>>>> Undergraduate >>>>>>>>> Department of Computer Science And Engineering >>>>>>>>> >>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards >>>>>>>> >>>>>>>> W.Dimuthu Upeksha >>>>>>>> Undergraduate >>>>>>>> Department of Computer Science And Engineering >>>>>>>> >>>>>>>> University of Moratuwa, Sri Lanka >>>>>> >>>>>> >>>>>> -- >>>>>> Regards >>>>>> >>>>>> W.Dimuthu Upeksha >>>>>> Undergraduate >>>>>> Department of Computer Science And Engineering >>>>>> >>>>>> University of Moratuwa, Sri Lanka >>>> >>> >>> >>> >>> -- >>> Regards >>> >>> W.Dimuthu Upeksha >>> Undergraduate >>> Department of Computer Science And Engineering >>> >>> University of Moratuwa, Sri Lanka >> >> > > > -- > Regards > > W.Dimuthu Upeksha > Undergraduate > Department of Computer Science And Engineering > > University of Moratuwa, Sri Lanka
