I should add that the OCR engine should be pluggable so PDFToText might use an interface, e.g. OCREngine and there will be a TesseractOCREngine class somewhere which provides the required functionality and lives in a separate jar file.
-- John > On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> wrote: > > So do you need to embed those new functionalities into existing PDFtoText > algorithms or package them as a new sub system(something like an API)? > > -----Original Message----- > From: "John Hewson" <[email protected]> > Sent: 26/02/2014 07:38 > To: "[email protected]" <[email protected]> > Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction > > Yes, exactly. By location data I just mean (x,y) coordinates and page > rotation. > > There is another use case for OCR: some fonts embedded in PDFs have corrupt > encodings, which means the ACSII codes map to the wrong glyphs. We could OCR > the glyphs to repair the encoding. > > -- John > >> On 25 Feb 2014, at 17:13, DImuthu Upeksha <[email protected]> wrote: >> >> Hi John, >> Thanks for the explanation. >> Let's say there is a pdf with both text in extractable format and some >> images with text(Scanned images). In that case first we extract those >> extractable content using PDFBox algorithms and rest is extracted using >> OCR. Finally we pack both results together and give output as PDFToText. Am >> I correct? What do you mean by "location data"? >> >> Thanks >> Dimuthu >> >> >>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]> wrote: >>> >>> 1. What is called "glyphs" ? >>> >>> http://en.wikipedia.org/wiki/Glyph >>> >>>> 2. What is the main requirement of this project? >>>> As far as I understood, first we need to generate an image of >>>> malformed pdfs from >>>> PDFBox and then we need to do processing using OCR for further accurate >>>> results. But the problem is, why shouldn't we directly do OCR on those >>>> PDFs without getting output from PDFBox? Correct me if I'm wrong. >>> >>> PDFBox can generate images (PDFToImage) and can extract text (PDFToText). >>> The goal of >>> this project is to enhance PDFToText so that it can use OCR to extract >>> text from areas of the >>> document where the text is embedded as an image. Such PDF files are >>> typically generated by >>> scanners or fax machines. There is also another case where OCR is useful: >>> some fonts embedded >>> in PDF files contain the wrong encoding, so when text is extracted with >>> PDFToText the result is >>> nonsense but when drawn with PDFToImage we see the correct letters. >>> >>> Instead of: >>> PDF => Image => OCR => Text >>> >>> We want to do: >>> PDF => (Many images for words + location data => OCR) => Text >>> >>> -- John >>> >>>> >>>> >>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < >>> [email protected] >>>>> wrote: >>>> >>>>> Ok fixed. This is what I did >>>>> Right click on the new project ->Debug As-> Debug Configurations >>> ->Source >>>>> ->Add -> Project >>>>> Then I selected PDFBox project. >>>>> >>>>> Thanks >>>>> Dimuthu >>>>> >>>>> >>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < >>>>> [email protected]> wrote: >>>>> >>>>>> I'm using eclipse. This is what I want. I created a new Java >>> application >>>>>> project (say TestPDFBox) with a main class with following code. >>>>>> >>>>>> PDDocument document = new PDDocument();PDPage blankPage = new >>> PDPage();document.addPage( blankPage >>> );document.save("BlankPage.pdf");document.close(); >>>>>> >>>>>> Then I need to add those jar files generated in target folder of PDFBox >>>>>> to build path of my new project (I did build the PDFBox project from >>>>>> source). That is what I did. But let's say I need to check the >>>>>> functionality of document.save("") method. But I don't have a >>> reference to >>>>>> it's sources because I directly used generated jars. As Tilman said I >>> built >>>>>> PDFBox from sources but I don't know a proper way to use it other >>> projects >>>>>> other than adding those jar files to build path. >>>>>> >>>>>> >>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]> >>> wrote: >>>>>> >>>>>>> Which IDE are you using? You should be able to run the PDFToText class >>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the >>> command >>>>>>> line argument. >>>>>>> >>>>>>> -- John >>>>>>> >>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < >>> [email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi John, >>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed to >>>>>>> build >>>>>>>> code successfully. I looked at the classes you mentioned and I got a >>>>>>> rough >>>>>>>> idea about how they are working. To check them I used the jars in >>>>>>> target >>>>>>>> folder to my separate java project. I tried samples in >>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into code >>>>>>>> specially how those processXXX() methods work in PDFTextStripper >>> class. >>>>>>>> What I usually do is adding some berakpoints and checking them in >>> debug >>>>>>>> windows. But using jars it's not possible. What is the way you follow >>>>>>> in >>>>>>>> order to do such task? >>>>>>>> >>>>>>>> As well I installed tesseract in to my machine and managed to do some >>>>>>> OCR >>>>>>>> stuff also. That's a cool tool which works fine. >>>>>>>> I'm still learning the code. If I get any issue I'll drop you a mail. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Dimuthu >>>>>>>> >>>>>>>> >>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]> >>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Dimuthu >>>>>>>>> >>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/ it >>>>>>> contains >>>>>>>>> a basic overview of the project >>>>>>>>> and details on how to obtain the source code and build PDFBox for >>>>>>> yourself. >>>>>>>>> >>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the only >>>>>>>>> thoughts so far regarding it. >>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all >>> under >>>>>>> the >>>>>>>>> Apache license, which is a >>>>>>>>> requirement. >>>>>>>>> >>>>>>>>> Once you have the source code, take a look at the PageDrawer class >>> to >>>>>>> see >>>>>>>>> how text and images are >>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one >>> glyph, >>>>>>>>> word, or sentence at a time) with >>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text is >>>>>>> currently >>>>>>>>> extracted, take a look at how >>>>>>>>> we have to go to great length to sort text back into reading order >>> and >>>>>>>>> infer the placement of diacritics - PDF >>>>>>>>> is fundamentally a visual format, not a structured format like HTML >>> - >>>>>>>>> which is why extracting text can be so >>>>>>>>> difficult sometimes. >>>>>>>>> >>>>>>>>> The full PDF Reference document can be found at: >>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>>>>>>>> >>>>>>>>> Feel free to discuss specifics of your proposal or ask any >>> questions. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> -- John >>>>>>>>> >>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < >>> [email protected] >>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at >>>>>>> University >>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with >>>>>>> Apache >>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image >>> processing >>>>>>>>> stuff. So I would like to select this project idea as my GSoC 2014 >>>>>>> project >>>>>>>>> because I feel like it is the best suited project for me. In >>>>>>> university >>>>>>>>> also we have done some research in OCR area and our group wrote a >>>>>>>>> literature review about increasing efficiency of OCR >>>>>>> systems(attached). Can >>>>>>>>> you please suggest me where to start learning about PDFBox? >>>>>>>>>> >>>>>>>>>> [1] >>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 >>>>>>>>>> >>>>>>>>>> Thank you >>>>>>>>>> Dimuthu >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Regards >>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>> Undergraduate >>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards >>>>>>>> >>>>>>>> W.Dimuthu Upeksha >>>>>>>> Undergraduate >>>>>>>> Department of Computer Science And Engineering >>>>>>>> >>>>>>>> University of Moratuwa, Sri Lanka >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards >>>>>> >>>>>> W.Dimuthu Upeksha >>>>>> Undergraduate >>>>>> Department of Computer Science And Engineering >>>>>> >>>>>> University of Moratuwa, Sri Lanka >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards >>>>> >>>>> W.Dimuthu Upeksha >>>>> Undergraduate >>>>> Department of Computer Science And Engineering >>>>> >>>>> University of Moratuwa, Sri Lanka >>>> >>>> >>>> >>>> -- >>>> Regards >>>> >>>> W.Dimuthu Upeksha >>>> Undergraduate >>>> Department of Computer Science And Engineering >>>> >>>> University of Moratuwa, Sri Lanka >> >> >> -- >> Regards >> >> W.Dimuthu Upeksha >> Undergraduate >> Department of Computer Science And Engineering >> >> University of Moratuwa, Sri Lanka
