Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Wed, 26 Feb 2014 15:14:43 -0800

I should add that the OCR engine should be pluggable so PDFToText might use an 
interface, e.g. OCREngine and there will be a TesseractOCREngine class 
somewhere which provides the required functionality and lives in a separate jar 
file.


-- John

> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> wrote:
> 
> So do you need to embed those new functionalities into existing PDFtoText 
> algorithms or package them as a new sub system(something like an API)? 
> 
> -----Original Message-----
> From: "John Hewson" <[email protected]>
> Sent: ‎26/‎02/‎2014 07:38
> To: "[email protected]" <[email protected]>
> Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction
> 
> Yes, exactly. By location data I just mean (x,y) coordinates and page 
> rotation.
> 
> There is another use case for OCR: some fonts embedded in PDFs have corrupt 
> encodings, which means the ACSII codes map to the wrong glyphs. We could OCR 
> the glyphs to repair the encoding.
> 
> -- John
> 
>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <[email protected]> wrote:
>> 
>> Hi John,
>> Thanks for the explanation.
>> Let's say there is a pdf with both text in extractable format and some
>> images with text(Scanned images). In that case first we extract those
>> extractable content using PDFBox algorithms and rest is extracted using
>> OCR. Finally we pack both results together and give output as PDFToText. Am
>> I correct? What do you mean by "location data"?
>> 
>> Thanks
>> Dimuthu
>> 
>> 
>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]> wrote:
>>> 
>>> 1. What is called "glyphs" ?
>>> 
>>> http://en.wikipedia.org/wiki/Glyph
>>> 
>>>> 2. What is the main requirement of this project?
>>>> As far as I understood, first we need to generate an image of
>>>> malformed pdfs from
>>>> PDFBox and then we need to do processing using OCR for further accurate
>>>> results.  But the problem is, why shouldn't we directly do OCR on those
>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>>> 
>>> PDFBox can generate images (PDFToImage) and can extract text (PDFToText).
>>> The goal of
>>> this project is to enhance PDFToText so that it can use OCR to extract
>>> text from areas of the
>>> document where the text is embedded as an image. Such PDF files are
>>> typically generated by
>>> scanners or fax machines. There is also another case where OCR is useful:
>>> some fonts embedded
>>> in PDF files contain the wrong encoding, so when text is extracted with
>>> PDFToText the result is
>>> nonsense but when drawn with PDFToImage we see the correct letters.
>>> 
>>> Instead of:
>>> PDF => Image => OCR => Text
>>> 
>>> We want to do:
>>> PDF => (Many images for words + location data => OCR) => Text
>>> 
>>> -- John
>>> 
>>>> 
>>>> 
>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>> [email protected]
>>>>> wrote:
>>>> 
>>>>> Ok fixed. This is what I did
>>>>> Right click on the new project ->Debug As-> Debug Configurations
>>> ->Source
>>>>> ->Add -> Project
>>>>> Then I selected PDFBox project.
>>>>> 
>>>>> Thanks
>>>>> Dimuthu
>>>>> 
>>>>> 
>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>> application
>>>>>> project (say TestPDFBox) with a main class with following code.
>>>>>> 
>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>> PDPage();document.addPage( blankPage
>>> );document.save("BlankPage.pdf");document.close();
>>>>>> 
>>>>>> Then I need to add those jar files generated in target folder of PDFBox
>>>>>> to build path of my new project (I did build the PDFBox project from
>>>>>> source). That is what I did. But let's say I need to check  the
>>>>>> functionality of document.save("") method. But I don't have a
>>> reference to
>>>>>> it's sources because I directly used generated jars. As Tilman said I
>>> built
>>>>>> PDFBox from sources but I don't know a proper way to use it other
>>> projects
>>>>>> other than adding those jar files to build path.
>>>>>> 
>>>>>> 
>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]>
>>> wrote:
>>>>>> 
>>>>>>> Which IDE are you using? You should be able to run the PDFToText class
>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
>>> command
>>>>>>> line argument.
>>>>>>> 
>>>>>>> -- John
>>>>>>> 
>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>> [email protected]>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi John,
>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed to
>>>>>>> build
>>>>>>>> code successfully. I looked at the classes you mentioned and I got a
>>>>>>> rough
>>>>>>>> idea about how they are working. To check them I used the jars in
>>>>>>> target
>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into code
>>>>>>>> specially how those processXXX() methods work in PDFTextStripper
>>> class.
>>>>>>>> What I usually do is adding some berakpoints and checking them in
>>> debug
>>>>>>>> windows. But using jars it's not possible. What is the way you follow
>>>>>>> in
>>>>>>>> order to do such task?
>>>>>>>> 
>>>>>>>> As well I installed tesseract in to my machine and managed to do some
>>>>>>> OCR
>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a mail.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dimuthu
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Dimuthu
>>>>>>>>> 
>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/ it
>>>>>>> contains
>>>>>>>>> a basic overview of the project
>>>>>>>>> and details on how to obtain the source code and build PDFBox for
>>>>>>> yourself.
>>>>>>>>> 
>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the only
>>>>>>>>> thoughts so far regarding it.
>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all
>>> under
>>>>>>> the
>>>>>>>>> Apache license, which is a
>>>>>>>>> requirement.
>>>>>>>>> 
>>>>>>>>> Once you have the source code, take a look at the PageDrawer class
>>> to
>>>>>>> see
>>>>>>>>> how text and images are
>>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one
>>> glyph,
>>>>>>>>> word, or sentence at a time) with
>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text is
>>>>>>> currently
>>>>>>>>> extracted, take a look at how
>>>>>>>>> we have to go to great length to sort text back into reading order
>>> and
>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>> is fundamentally a visual format, not a structured format like HTML
>>> -
>>>>>>>>> which is why extracting text can be so
>>>>>>>>> difficult sometimes.
>>>>>>>>> 
>>>>>>>>> The full PDF Reference document can be found at:
>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>> 
>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>> questions.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> -- John
>>>>>>>>> 
>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>> [email protected]
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
>>>>>>> University
>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with
>>>>>>> Apache
>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>>> processing
>>>>>>>>> stuff. So I would like to select this project idea as my GSoC 2014
>>>>>>> project
>>>>>>>>> because I feel like it is the best suited project for me. In
>>>>>>> university
>>>>>>>>> also we have done some research in OCR area and our group wrote a
>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>> systems(attached). Can
>>>>>>>>> you please suggest me where to start learning about PDFBox?
>>>>>>>>>> 
>>>>>>>>>> [1]
>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>> 
>>>>>>>>>> Thank you
>>>>>>>>>> Dimuthu
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>> 
>> 
>> -- 
>> Regards
>> 
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>> 
>> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to