I updated necessary changes to the document [1] For last two days I had a deep look at this [2] jni wrapper for tessaract api. Unfortunately this has been designed for Android environment so I think we need to write our own make files to build this in to a dll(windows) or dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a way to convert it to a make file that we can run on console. Please suggest if you have a better approach
[1] https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf [2] https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/ [3] https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]> wrote: > This is a good start. However, there is no need for the Adder component, > "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor". > > Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear > where the process starts. > > -- John > > On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]> > wrote: > > > Sorry for the mistake. I added it to my Dropbox [1]. > > > > [1] > > > https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf > > > > Thanks > > Dimuthu > > > > > > On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]> wrote: > > > >> I should add that the OCR engine should be pluggable so PDFToText might > >> use an interface, e.g. OCREngine and there will be a TesseractOCREngine > >> class somewhere which provides the required functionality and lives in a > >> separate jar file. > >> > >> -- John > >> > >>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> wrote: > >>> > >>> So do you need to embed those new functionalities into existing > >> PDFtoText algorithms or package them as a new sub system(something like > an > >> API)? > >>> > >>> -----Original Message----- > >>> From: "John Hewson" <[email protected]> > >>> Sent: 26/02/2014 07:38 > >>> To: "[email protected]" <[email protected]> > >>> Subject: Re: [GSoC 2014]Optical Character Recognition project - > >> Introduction > >>> > >>> Yes, exactly. By location data I just mean (x,y) coordinates and page > >> rotation. > >>> > >>> There is another use case for OCR: some fonts embedded in PDFs have > >> corrupt encodings, which means the ACSII codes map to the wrong glyphs. > We > >> could OCR the glyphs to repair the encoding. > >>> > >>> -- John > >>> > >>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <[email protected] > > > >> wrote: > >>>> > >>>> Hi John, > >>>> Thanks for the explanation. > >>>> Let's say there is a pdf with both text in extractable format and some > >>>> images with text(Scanned images). In that case first we extract those > >>>> extractable content using PDFBox algorithms and rest is extracted > using > >>>> OCR. Finally we pack both results together and give output as > >> PDFToText. Am > >>>> I correct? What do you mean by "location data"? > >>>> > >>>> Thanks > >>>> Dimuthu > >>>> > >>>> > >>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]> > >> wrote: > >>>>> > >>>>> 1. What is called "glyphs" ? > >>>>> > >>>>> http://en.wikipedia.org/wiki/Glyph > >>>>> > >>>>>> 2. What is the main requirement of this project? > >>>>>> As far as I understood, first we need to generate an image of > >>>>>> malformed pdfs from > >>>>>> PDFBox and then we need to do processing using OCR for further > >> accurate > >>>>>> results. But the problem is, why shouldn't we directly do OCR on > >> those > >>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong. > >>>>> > >>>>> PDFBox can generate images (PDFToImage) and can extract text > >> (PDFToText). > >>>>> The goal of > >>>>> this project is to enhance PDFToText so that it can use OCR to > extract > >>>>> text from areas of the > >>>>> document where the text is embedded as an image. Such PDF files are > >>>>> typically generated by > >>>>> scanners or fax machines. There is also another case where OCR is > >> useful: > >>>>> some fonts embedded > >>>>> in PDF files contain the wrong encoding, so when text is extracted > with > >>>>> PDFToText the result is > >>>>> nonsense but when drawn with PDFToImage we see the correct letters. > >>>>> > >>>>> Instead of: > >>>>> PDF => Image => OCR => Text > >>>>> > >>>>> We want to do: > >>>>> PDF => (Many images for words + location data => OCR) => Text > >>>>> > >>>>> -- John > >>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < > >>>>> [email protected] > >>>>>>> wrote: > >>>>>> > >>>>>>> Ok fixed. This is what I did > >>>>>>> Right click on the new project ->Debug As-> Debug Configurations > >>>>> ->Source > >>>>>>> ->Add -> Project > >>>>>>> Then I selected PDFBox project. > >>>>>>> > >>>>>>> Thanks > >>>>>>> Dimuthu > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> I'm using eclipse. This is what I want. I created a new Java > >>>>> application > >>>>>>>> project (say TestPDFBox) with a main class with following code. > >>>>>>>> > >>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new > >>>>> PDPage();document.addPage( blankPage > >>>>> );document.save("BlankPage.pdf");document.close(); > >>>>>>>> > >>>>>>>> Then I need to add those jar files generated in target folder of > >> PDFBox > >>>>>>>> to build path of my new project (I did build the PDFBox project > from > >>>>>>>> source). That is what I did. But let's say I need to check the > >>>>>>>> functionality of document.save("") method. But I don't have a > >>>>> reference to > >>>>>>>> it's sources because I directly used generated jars. As Tilman > said > >> I > >>>>> built > >>>>>>>> PDFBox from sources but I don't know a proper way to use it other > >>>>> projects > >>>>>>>> other than adding those jar files to build path. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]> > >>>>> wrote: > >>>>>>>> > >>>>>>>>> Which IDE are you using? You should be able to run the PDFToText > >> class > >>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the > >>>>> command > >>>>>>>>> line argument. > >>>>>>>>> > >>>>>>>>> -- John > >>>>>>>>> > >>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < > >>>>> [email protected]> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Hi John, > >>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed > to > >>>>>>>>> build > >>>>>>>>>> code successfully. I looked at the classes you mentioned and I > >> got a > >>>>>>>>> rough > >>>>>>>>>> idea about how they are working. To check them I used the jars > in > >>>>>>>>> target > >>>>>>>>>> folder to my separate java project. I tried samples in > >>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into > >> code > >>>>>>>>>> specially how those processXXX() methods work in PDFTextStripper > >>>>> class. > >>>>>>>>>> What I usually do is adding some berakpoints and checking them > in > >>>>> debug > >>>>>>>>>> windows. But using jars it's not possible. What is the way you > >> follow > >>>>>>>>> in > >>>>>>>>>> order to do such task? > >>>>>>>>>> > >>>>>>>>>> As well I installed tesseract in to my machine and managed to do > >> some > >>>>>>>>> OCR > >>>>>>>>>> stuff also. That's a cool tool which works fine. > >>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a > >> mail. > >>>>>>>>>> > >>>>>>>>>> Thanks > >>>>>>>>>> Dimuthu > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson < > [email protected] > >>> > >>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi Dimuthu > >>>>>>>>>>> > >>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it > >>>>>>>>> contains > >>>>>>>>>>> a basic overview of the project > >>>>>>>>>>> and details on how to obtain the source code and build PDFBox > for > >>>>>>>>> yourself. > >>>>>>>>>>> > >>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the > >> only > >>>>>>>>>>> thoughts so far regarding it. > >>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all > >>>>> under > >>>>>>>>> the > >>>>>>>>>>> Apache license, which is a > >>>>>>>>>>> requirement. > >>>>>>>>>>> > >>>>>>>>>>> Once you have the source code, take a look at the PageDrawer > >> class > >>>>> to > >>>>>>>>> see > >>>>>>>>>>> how text and images are > >>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one > >>>>> glyph, > >>>>>>>>>>> word, or sentence at a time) with > >>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text > is > >>>>>>>>> currently > >>>>>>>>>>> extracted, take a look at how > >>>>>>>>>>> we have to go to great length to sort text back into reading > >> order > >>>>> and > >>>>>>>>>>> infer the placement of diacritics - PDF > >>>>>>>>>>> is fundamentally a visual format, not a structured format like > >> HTML > >>>>> - > >>>>>>>>>>> which is why extracting text can be so > >>>>>>>>>>> difficult sometimes. > >>>>>>>>>>> > >>>>>>>>>>> The full PDF Reference document can be found at: > >>>>> > >> > http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf > >>>>>>>>>>> > >>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any > >>>>> questions. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> > >>>>>>>>>>> -- John > >>>>>>>>>>> > >>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < > >>>>> [email protected] > >>>>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at > >>>>>>>>> University > >>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 > with > >>>>>>>>> Apache > >>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image > >>>>> processing > >>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC > >> 2014 > >>>>>>>>> project > >>>>>>>>>>> because I feel like it is the best suited project for me. In > >>>>>>>>> university > >>>>>>>>>>> also we have done some research in OCR area and our group > wrote a > >>>>>>>>>>> literature review about increasing efficiency of OCR > >>>>>>>>> systems(attached). Can > >>>>>>>>>>> you please suggest me where to start learning about PDFBox? > >>>>>>>>>>>> > >>>>>>>>>>>> [1] > >>>>> > >> > http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 > >>>>>>>>>>>> > >>>>>>>>>>>> Thank you > >>>>>>>>>>>> Dimuthu > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> Regards > >>>>>>>>>>>> W.Dimuthu Upeksha > >>>>>>>>>>>> Undergraduate > >>>>>>>>>>>> Department of Computer Science And Engineering > >>>>>>>>>>>> University of Moratuwa, Sri Lanka > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Regards > >>>>>>>>>> > >>>>>>>>>> W.Dimuthu Upeksha > >>>>>>>>>> Undergraduate > >>>>>>>>>> Department of Computer Science And Engineering > >>>>>>>>>> > >>>>>>>>>> University of Moratuwa, Sri Lanka > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Regards > >>>>>>>> > >>>>>>>> W.Dimuthu Upeksha > >>>>>>>> Undergraduate > >>>>>>>> Department of Computer Science And Engineering > >>>>>>>> > >>>>>>>> University of Moratuwa, Sri Lanka > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Regards > >>>>>>> > >>>>>>> W.Dimuthu Upeksha > >>>>>>> Undergraduate > >>>>>>> Department of Computer Science And Engineering > >>>>>>> > >>>>>>> University of Moratuwa, Sri Lanka > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Regards > >>>>>> > >>>>>> W.Dimuthu Upeksha > >>>>>> Undergraduate > >>>>>> Department of Computer Science And Engineering > >>>>>> > >>>>>> University of Moratuwa, Sri Lanka > >>>> > >>>> > >>>> -- > >>>> Regards > >>>> > >>>> W.Dimuthu Upeksha > >>>> Undergraduate > >>>> Department of Computer Science And Engineering > >>>> > >>>> University of Moratuwa, Sri Lanka > >> > > > > > > > > -- > > Regards > > > > W.Dimuthu Upeksha > > Undergraduate > > Department of Computer Science And Engineering > > > > University of Moratuwa, Sri Lanka > > -- Regards W.Dimuthu Upeksha Undergraduate Department of Computer Science And Engineering University of Moratuwa, Sri Lanka
