Hi John, For now I'm using those methods to debug the wrapper. I'll remove those methods after I finished testing it.
I started implementing OCR-plugin [1] for PDFBox. Currently it satisfies basic requirements such as getting word+location data [2]. Please have a look at that and let me know if any changes are required. [1] https://github.com/DImuthuUpe/OCR-Plugin [2] https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/ocr/OCRConnector.java Thanks Dimuthu On Fri, Mar 14, 2014 at 12:09 AM, John Hewson <j...@jahewson.com> wrote: > Thanks, I saw your new refactoring too, it's good. Now the following methods > are no longer needed: > > public void setImagePath(String path) > public void setImage(byte[] imagedata, int width, int height, int bpp,int bpl) > > Cheers > > -- John > > On 11 Mar 2014, at 22:58, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: > >> Hi John, >> Yes. I implemented a new method to accept byte streams of the image as >> an input. We directly can't send BufferedImage objects to native side. >> So what I did is converting buffered image into a byte array and >> passed it in to native side. At the native side it again converts in >> to compatible format. With that request we need to pass some metadata >> of byte stream like image width, height, bytes per pixel and bytes per >> row. I checked it with this [2] test case and it works fine. >> >> [1] >> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74 >> [2] >> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java >> >> Thanks >> Dimuthu >> >> On Wed, Mar 12, 2014 at 12:40 AM, John Hewson <j...@jahewson.com> wrote: >>> Hi Dimuthu >>> >>> The Tesseract wrapper needs to take its input from a BufferedImage rather >>> than reading a file from disk, so instead of: >>> >>> api.setImagePath("test.tif"); >>> >>> What we need is: >>> >>> BufferedImage image = ImageIO.read(new File("test.tif")); >>> api.setImagePath(image); >>> >>> Because this will let us used the BufferedImage generated by PDFRenderer >>> without round-tripping to the disk. >>> >>> -- John >>> >>> On 11 Mar 2014, at 11:13, DImuthu Upeksha <dimuthu.upeks...@gmail.com> >>> wrote: >>> >>>> Hi John, >>>> Thanks for the guidance. >>>> I did a small analysis of the accuracy and performance of new >>>> Tesseract wrapper. I used this [1] image as the input image and got >>>> following data [2] after OCR. First line is the recognised word >>>> followed by location details (bounding box) of the word. I think these >>>> details are pretty much enough for our task. Now what remaining is >>>> converting pdf file into a image as you have mentioned. These days I'm >>>> working on it. >>>> >>>> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF >>>> [2] https://gist.github.com/DImuthuUpe/9491660 >>>> >>>> Thanks >>>> Dimuthu >>>> >>>> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <j...@jahewson.com> wrote: >>>>> Dimuthu, >>>>> >>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can >>>>>> be >>>>>> build using maven. Some useful methods that are needed to do basic OCR >>>>>> were >>>>>> implemented. >>>>> >>>>> Great, it's looking good, nice and clean. >>>>> >>>>>> 1. What is the task of processStream method in PDFTextStripper class line >>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(), >>>>>> page.findRotation() ); >>>>> >>>>> A PDF file is made up of pages, each of which contains a "content >>>>> stream". This content stream contains a list of drawing commands such as >>>>> "move to 10,15" or "write the word `foo`", these are called operators. >>>>> The processStream function reads the stream for the current page and >>>>> executes each of the operators. The operators themselves are implemented >>>>> each in their own class which is a subclass of PDFOperator. The >>>>> constructor of PDFStreamEngine creates the operator classes using >>>>> reflection, which is rather odd and I'm not sure why this design was >>>>> chosen. The operators used by PDFTextStripper can be found in >>>>> org/apache/pdfbox/resources/PDFTextStripper.properties >>>>> >>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is >>>>>> the better approach to do it? >>>>> >>>>> You could subclass PDFTextStripper and override the startDocument method >>>>> and use it to create a PDFRenderer and store it in a field. Then override >>>>> the processPage method and use the previously created PDFRenderer to >>>>> render the current page to a buffered image and perform OCR on the image. >>>>> Once you have the OCR text + positions, instead of calling processStream >>>>> you can call processTextPosition once for each character + position. >>>>> >>>>> The PDFRenderer class was just added to the trunk, so make sure you do an >>>>> "svn update". Let me know if you need me to change PDFTextStripper to >>>>> make it easier to subclass. >>>>> >>>>> Cheers >>>>> >>>>> -- John >>>>> >>>>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <dimuthu.upeks...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi John, >>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can >>>>>> be >>>>>> build using maven. Some useful methods that are needed to do basic OCR >>>>>> were >>>>>> implemented. >>>>>> >>>>>> I went through PDFBox code several times and got couple of issues that >>>>>> are >>>>>> needed to be clarified >>>>>> >>>>>> 1. What is the task of processStream method in PDFTextStripper class line >>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(), >>>>>> page.findRotation() ); >>>>>> >>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the >>>>>> better approach to do it? >>>>>> >>>>>> Thanks >>>>>> Dimuthu >>>>>> >>>>>> >>>>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha >>>>>> <dimuthu.upeks...@gmail.com>wrote: >>>>>> >>>>>>> Hi John >>>>>>> I refactored Tesseract JNI code to support maven build. To create the >>>>>>> JNI >>>>>>> library I added pre-built static libraries of Tesseract and Leptonica to >>>>>>> resources folder[2]. For now it includes librararies supported for mac. >>>>>>> But >>>>>>> we can easily add both windows and linux libraries. After "mvn clean >>>>>>> install", the jar is created under target folder. Now all setting up is >>>>>>> done. What remains is implementing those native methods in >>>>>>> tessbaseapi.cpp >>>>>>> [3]. Hope to finish it asap. Please let me know if there is any concern >>>>>>> about project structure. >>>>>>> >>>>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git >>>>>>> [2] >>>>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources >>>>>>> [3] >>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp >>>>>>> >>>>>>> Thanks >>>>>>> Dimuthu >>>>>>> >>>>>>> >>>>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote: >>>>>>> >>>>>>>> Dimuthu >>>>>>>> >>>>>>>>> There is a lot of code >>>>>>>>> fractions in current android jni wrapper which use "(jint)somePointer" >>>>>>>>> casting which will create terrible memory leaks in 64 bit environments >>>>>>>>> because ponters are 64 bit. So I believe writing it from the beginning >>>>>>>> is >>>>>>>>> much better. >>>>>>>> >>>>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to >>>>>>>> support >>>>>>>> 64-bit JVMs. >>>>>>>> >>>>>>>>> we can use >>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think >>>>>>>> it is >>>>>>>>> not a issue to use it's static library because both Tesseract and >>>>>>>> Leptonica >>>>>>>>> is under apache licence. >>>>>>>> >>>>>>>> Sounds good, I found the following in the README: >>>>>>>> >>>>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer >>>>>>>> compiles >>>>>>>> without Leptonica. >>>>>>>> >>>>>>>> Which makes sense. >>>>>>>> >>>>>>>> -- John >>>>>>>> >>>>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <dimuthu.upeks...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi John, >>>>>>>>> +1 for you suggestion about converting image <=> byte array at java >>>>>>>> side. >>>>>>>>> It reduces lot of complexities. I don't know whether you have noticed >>>>>>>>> or >>>>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my >>>>>>>> Mac >>>>>>>>> but don't know about other operating systems. >>>>>>>>> >>>>>>>>> Leptonica is the image processing library for Tesseract [1]. What >>>>>>>> tesseract >>>>>>>>> do is using image processing algorithms in Leptonica to implement its >>>>>>>> OCR >>>>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract >>>>>>>> API. >>>>>>>>> You can see it includes allheaders.h header file which is the main >>>>>>>> header >>>>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first >>>>>>>>> and >>>>>>>>> link it when we build Tesseract. This is not a big problem if we can >>>>>>>>> use >>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think >>>>>>>> it is >>>>>>>>> not a issue to use it's static library because both Tesseract and >>>>>>>> Leptonica >>>>>>>>> is under apache licence. >>>>>>>>> >>>>>>>>> I'm working on the maven implementation you have mentioned and will >>>>>>>>> get >>>>>>>>> back to you soon. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Dimuthu >>>>>>>>> >>>>>>>>> >>>>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling >>>>>>>>> [2] >>>>>>>>> >>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Dimuthu, >>>>>>>>>> >>>>>>>>>> 1,2,3: >>>>>>>>>> >>>>>>>>>> Feel free to write your own Tesseract binding or port the existing >>>>>>>> code as >>>>>>>>>> you see fit. >>>>>>>>>> The JNI binding should be minimal, only the methods you require need >>>>>>>> to be >>>>>>>>>> wrapped. >>>>>>>>>> Also, don't forget that some of the interop can be done in Java, for >>>>>>>>>> example if it is easier >>>>>>>>>> to convert a BufferedImage to a byte array in Java then do it there >>>>>>>>>> and >>>>>>>>>> pass the result >>>>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same >>>>>>>>>> result. >>>>>>>>>> >>>>>>>>>> Your GitHub repo looks like a good start, I can make comments there >>>>>>>>>> as >>>>>>>>>> things progress. >>>>>>>>>> >>>>>>>>>> Is it possible to build Tesseract without leptonica? I was under the >>>>>>>>>> impression that it was >>>>>>>>>> used for image i/o only, but I may be misinformed. >>>>>>>>>> >>>>>>>>>> 4: The native platform library should be built as part of the Maven >>>>>>>> build >>>>>>>>>> for the Tesseract >>>>>>>>>> wrapper which can be a separate project. The output can be a jar file >>>>>>>>>> which contains the >>>>>>>>>> native binaries. It should be possible for the jar to contain >>>>>>>>>> prebuilt >>>>>>>>>> binaries for all platforms >>>>>>>>>> but this is something we can worry about later. Right now the goal >>>>>>>> should >>>>>>>>>> be to build a jar >>>>>>>>>> containing just the current platform's native binary and any Java >>>>>>>> wrapper >>>>>>>>>> code. >>>>>>>>>> >>>>>>>>>> -- John >>>>>>>>>> >>>>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <dimuthu.upeks...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi John, >>>>>>>>>>> >>>>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my >>>>>>>>>>> observation >>>>>>>>>>> >>>>>>>>>>> 1. This wrapper heavily depends on android image libraries. >>>>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this >>>>>>>>>>> library. >>>>>>>>>>> >>>>>>>>>>> 2. But I can understand underlying logic in each function. Basically >>>>>>>> what >>>>>>>>>>> it does is mapping between tesseract api functions [2] with java >>>>>>>> methods. >>>>>>>>>>> In between it does to some image <=> byte array like conversions by >>>>>>>> using >>>>>>>>>>> that bitmap libraries in Android >>>>>>>>>>> >>>>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible >>>>>>>> with >>>>>>>>>> our >>>>>>>>>>> environments(linux,windows and mac) which is really painful. Also it >>>>>>>> will >>>>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and >>>>>>>>>>> implement using our codes >>>>>>>>>>> >>>>>>>>>>> I think 2nd solution is better because we need only few operations >>>>>>>>>>> to >>>>>>>> be >>>>>>>>>>> done using tesseract library. I have created a github repo [3] for >>>>>>>> this. >>>>>>>>>>> It's still not finished. I need to add some make files and build >>>>>>>> files to >>>>>>>>>>> make it run properly. And also I need to implement those wrapper >>>>>>>>>> functions >>>>>>>>>>> [3]. This may take some time. >>>>>>>>>>> >>>>>>>>>>> 4. Because we are calling native libraries we need different builds >>>>>>>>>>> of >>>>>>>>>>> tesseract and leptonica libraries for each platform (dll for >>>>>>>>>>> windows, >>>>>>>> so >>>>>>>>>>> for linux, dylib for mac). So we may need to build those libraries >>>>>>>>>>> at >>>>>>>> the >>>>>>>>>>> time we build pdfbox project. Or we can pre build those libraries >>>>>>>>>>> and >>>>>>>> add >>>>>>>>>>> them to the project as .dll, .so or .dylib format. What is the >>>>>>>> preferred >>>>>>>>>>> way? >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> >>>>>>>>>> >>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp >>>>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample >>>>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API >>>>>>>>>>> [4] >>>>>>>>>>> >>>>>>>>>> >>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Dimuthu >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha < >>>>>>>>>> dimuthu.upeks...@gmail.com >>>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I updated necessary changes to the document [1] >>>>>>>>>>>> >>>>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for >>>>>>>>>> tessaract >>>>>>>>>>>> api. >>>>>>>>>>>> Unfortunately this has been designed for Android environment so I >>>>>>>> think >>>>>>>>>> we >>>>>>>>>>>> need to write our own make files to build this in to a dll(windows) >>>>>>>> or >>>>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching >>>>>>>> for >>>>>>>>>> a >>>>>>>>>>>> way to convert it to a make file that we can run on console. Please >>>>>>>>>> suggest >>>>>>>>>>>> if you have a better approach >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf >>>>>>>>>>>> [2] >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/ >>>>>>>>>>>> [3] >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <j...@jahewson.com> >>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> This is a good start. However, there is no need for the Adder >>>>>>>>>> component, >>>>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text >>>>>>>>>> Extractor". >>>>>>>>>>>>> >>>>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it >>>>>>>> clear >>>>>>>>>>>>> where the process starts. >>>>>>>>>>>>> >>>>>>>>>>>>> -- John >>>>>>>>>>>>> >>>>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha < >>>>>>>> dimuthu.upeks...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1]. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <j...@jahewson.com> >>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I should add that the OCR engine should be pluggable so >>>>>>>>>>>>>>> PDFToText >>>>>>>>>> might >>>>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a >>>>>>>>>> TesseractOCREngine >>>>>>>>>>>>>>> class somewhere which provides the required functionality and >>>>>>>> lives >>>>>>>>>> in >>>>>>>>>>>>> a >>>>>>>>>>>>>>> separate jar file. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <dimuthu.upeks...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So do you need to embed those new functionalities into existing >>>>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub >>>>>>>>>>>>>>> system(something >>>>>>>>>>>>> like an >>>>>>>>>>>>>>> API)? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>>> From: "John Hewson" <j...@jahewson.com> >>>>>>>>>>>>>>>> Sent: 26/02/2014 07:38 >>>>>>>>>>>>>>>> To: "dev@pdfbox.apache.org" <dev@pdfbox.apache.org> >>>>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project - >>>>>>>>>>>>>>> Introduction >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates >>>>>>>>>>>>>>>> and >>>>>>>>>> page >>>>>>>>>>>>>>> rotation. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs >>>>>>>> have >>>>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong >>>>>>>>>>>>> glyphs. We >>>>>>>>>>>>>>> could OCR the glyphs to repair the encoding. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha < >>>>>>>>>>>>> dimuthu.upeks...@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>>>>> Thanks for the explanation. >>>>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format >>>>>>>> and >>>>>>>>>>>>> some >>>>>>>>>>>>>>>>> images with text(Scanned images). In that case first we >>>>>>>>>>>>>>>>> extract >>>>>>>>>> those >>>>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is >>>>>>>> extracted >>>>>>>>>>>>> using >>>>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as >>>>>>>>>>>>>>> PDFToText. Am >>>>>>>>>>>>>>>>> I correct? What do you mean by "location data"? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson < >>>>>>>> j...@jahewson.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1. What is called "glyphs" ? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2. What is the main requirement of this project? >>>>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image >>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>> malformed pdfs from >>>>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for >>>>>>>>>>>>>>>>>>> further >>>>>>>>>>>>>>> accurate >>>>>>>>>>>>>>>>>>> results. But the problem is, why shouldn't we directly do >>>>>>>> OCR on >>>>>>>>>>>>>>> those >>>>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm >>>>>>>> wrong. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text >>>>>>>>>>>>>>> (PDFToText). >>>>>>>>>>>>>>>>>> The goal of >>>>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>> extract >>>>>>>>>>>>>>>>>> text from areas of the >>>>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF >>>>>>>>>>>>>>>>>> files >>>>>>>>>> are >>>>>>>>>>>>>>>>>> typically generated by >>>>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where >>>>>>>>>>>>>>>>>> OCR >>>>>>>> is >>>>>>>>>>>>>>> useful: >>>>>>>>>>>>>>>>>> some fonts embedded >>>>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is >>>>>>>> extracted >>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>> PDFToText the result is >>>>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct >>>>>>>>>> letters. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Instead of: >>>>>>>>>>>>>>>>>> PDF => Image => OCR => Text >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We want to do: >>>>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < >>>>>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Ok fixed. This is what I did >>>>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug >>>>>>>> Configurations >>>>>>>>>>>>>>>>>> ->Source >>>>>>>>>>>>>>>>>>>> ->Add -> Project >>>>>>>>>>>>>>>>>>>> Then I selected PDFBox project. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < >>>>>>>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new >>>>>>>>>>>>>>>>>>>>> Java >>>>>>>>>>>>>>>>>> application >>>>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following >>>>>>>> code. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = >>>>>>>> new >>>>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage >>>>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close(); >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target >>>>>>>> folder >>>>>>>>>> of >>>>>>>>>>>>>>> PDFBox >>>>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox >>>>>>>> project >>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check >>>>>>>> the >>>>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't >>>>>>>>>>>>>>>>>>>>> have >>>>>>>> a >>>>>>>>>>>>>>>>>> reference to >>>>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As >>>>>>>> Tilman >>>>>>>>>>>>> said >>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>> built >>>>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use >>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>> other >>>>>>>>>>>>>>>>>> projects >>>>>>>>>>>>>>>>>>>>> other than adding those jar files to build path. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson < >>>>>>>>>> j...@jahewson.com> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the >>>>>>>>>> PDFToText >>>>>>>>>>>>>>> class >>>>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path >>>>>>>> as >>>>>>>>>> the >>>>>>>>>>>>>>>>>> command >>>>>>>>>>>>>>>>>>>>>> line argument. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < >>>>>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com> >>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and >>>>>>>>>>>>> managed to >>>>>>>>>>>>>>>>>>>>>> build >>>>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned >>>>>>>> and >>>>>>>>>> I >>>>>>>>>>>>>>> got a >>>>>>>>>>>>>>>>>>>>>> rough >>>>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>> jars >>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>> target >>>>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in >>>>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further >>>>>>>> look >>>>>>>>>>>>> into >>>>>>>>>>>>>>> code >>>>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in >>>>>>>>>>>>> PDFTextStripper >>>>>>>>>>>>>>>>>> class. >>>>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and >>>>>>>>>>>>>>>>>>>>>>> checking >>>>>>>>>> them >>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>> debug >>>>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the >>>>>>>>>>>>>>>>>>>>>>> way >>>>>>>>>> you >>>>>>>>>>>>>>> follow >>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> order to do such task? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and >>>>>>>> managed to >>>>>>>>>>>>> do >>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>>>>>>> OCR >>>>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine. >>>>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll >>>>>>>>>>>>>>>>>>>>>>> drop >>>>>>>>>> you a >>>>>>>>>>>>>>> mail. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson < >>>>>>>>>>>>> j...@jahewson.com >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at >>>>>>>>>> http://pdfbox.apache.org/it >>>>>>>>>>>>>>>>>>>>>> contains >>>>>>>>>>>>>>>>>>>>>>>> a basic overview of the project >>>>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build >>>>>>>>>> PDFBox >>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>> yourself. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 >>>>>>>> details >>>>>>>>>>>>> the >>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it. >>>>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue >>>>>>>> are >>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>> under >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>> Apache license, which is a >>>>>>>>>>>>>>>>>>>>>>>> requirement. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the >>>>>>>> PageDrawer >>>>>>>>>>>>>>> class >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>> how text and images are >>>>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level >>>>>>>> (e.g. >>>>>>>>>>>>> one >>>>>>>>>>>>>>>>>> glyph, >>>>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with >>>>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is >>>>>>>>>>>>>>>>>>>>>>>> how >>>>>>>>>> text >>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>> currently >>>>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how >>>>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into >>>>>>>> reading >>>>>>>>>>>>>>> order >>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF >>>>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured >>>>>>>>>>>>>>>>>>>>>>>> format >>>>>>>>>> like >>>>>>>>>>>>>>> HTML >>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so >>>>>>>>>>>>>>>>>>>>>>>> difficult sometimes. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask >>>>>>>> any >>>>>>>>>>>>>>>>>> questions. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < >>>>>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering >>>>>>>> Undergraduate >>>>>>>>>> at >>>>>>>>>>>>>>>>>>>>>> University >>>>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC >>>>>>>> 2013 >>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>>> Apache >>>>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and >>>>>>>> image >>>>>>>>>>>>>>>>>> processing >>>>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as >>>>>>>>>>>>>>>>>>>>>>>> my >>>>>>>>>> GSoC >>>>>>>>>>>>>>> 2014 >>>>>>>>>>>>>>>>>>>>>> project >>>>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for >>>>>>>> me. In >>>>>>>>>>>>>>>>>>>>>> university >>>>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our >>>>>>>>>>>>>>>>>>>>>>>> group >>>>>>>>>>>>> wrote a >>>>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR >>>>>>>>>>>>>>>>>>>>>> systems(attached). Can >>>>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about >>>>>>>> PDFBox? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thank you >>>>>>>>>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Regards >>>>>>>>>>>>>> >>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>> >>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Regards >>>>>>>>>>>> >>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>> Undergraduate >>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>> >>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>> Undergraduate >>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>> >>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> W.Dimuthu Upeksha >>>>>>>>> Undergraduate >>>>>>>>> Department of Computer Science And Engineering >>>>>>>>> >>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Regards >>>>>>> >>>>>>> W.Dimuthu Upeksha >>>>>>> Undergraduate >>>>>>> Department of Computer Science And Engineering >>>>>>> >>>>>>> University of Moratuwa, Sri Lanka >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards >>>>>> >>>>>> W.Dimuthu Upeksha >>>>>> Undergraduate >>>>>> Department of Computer Science And Engineering >>>>>> >>>>>> University of Moratuwa, Sri Lanka >>>>> >>>> >>>> >>>> >>>> -- >>>> Regards >>>> >>>> W.Dimuthu Upeksha >>>> Undergraduate >>>> >>>> Department of Computer Science And Engineering >>>> >>>> University of Moratuwa, Sri Lanka >>> >> >> >> >> -- >> Regards >> >> W.Dimuthu Upeksha >> Undergraduate >> >> Department of Computer Science And Engineering >> >> University of Moratuwa, Sri Lanka > -- Regards W.Dimuthu Upeksha Undergraduate Department of Computer Science And Engineering University of Moratuwa, Sri Lanka