Re: [iText-questions] invalid strings when doing textextract.

Mark Storer Tue, 21 Sep 2010 10:08:07 -0700

There are a few open source OCR projects floating around out there.  

OCRopus is a c/c++ project hosted by Google: http://code.google.com/p/ocropus/


Tesseract-ocr is another c/c++ project hosted by google: 
http://code.google.com/p/tesseract-ocr/

I'm noticing a pattern...

Apparently, the only pure-java ocr's are commercial, or awful.  Fun choices.

--Mark Storer
  Senior Software Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer<Cardiff> DisCard = null;
 
 

> -----Original Message-----
> From: 1T3XT info [mailto:[email protected]] 
> Sent: Tuesday, September 21, 2010 4:28 AM
> To: Post all your questions about iText here
> Cc: mp
> Subject: Re: [iText-questions] invalid strings when doing textextract.
> 
> On 20/09/2010 10:21, mp wrote:
> > I attach a new pdf. I use your code:
> > I get the same error.
> 
> OK, the err.pdf you've sent is completely different from the 
> initial document. We've had a pointless discussion about the 
> same subject on the list a couple of weeks ago.
> 
> Somebody said: iText can't parse this file.
> We replied: no tool can parse this file.
> Then the OP got angry thinking we didn't want to help him, 
> although we tried to explain with hands and feet what had happened.
> 
> If you read chapter 2 of the book (I hope you took the time 
> to do that before starting an adventure with iText, just as 
> you take drivers lessons before you take place behind the 
> wheel of a car), you read:
> Characters in a file are rendered on screen or on paper as glyphs. 
> ISO-32000-1, section 9.2.1, states: "A character is an 
> abstract symbol, whereas a glyph is a specific graphical 
> rendering of a character. For
> example: The glyphs A, /A/, and *A* are renderings of the 
> abstract 'A' 
> character. Glyphs are organized into fonts. A font defines 
> glyphs for a particular character set."
> 
> So a glyph on a page is not the same as a character.
> 
> Now let's skip to chapter 11:
> "Glyphs in a simple font are selected using a single byte. 
> Each glyph corresponds to a character that has a value from 0 
> to 255. The mapping between the characters and the glyphs is 
> called the character encoding."
> 
> If you have a language with more than 256 different glyphs, 
> and you want to use a font as a "simple font", it goes 
> without saying that you'll need a special encoding. The 
> character A won't necessary be mapped to a glyph that looks like an A.
> 
> This is explained in chapter 15:
> "It's possible for a PDF to have a font containing characters 
> that appear in a content stream as a, b, c, and so on, but 
> for which the shapes drawn in the PDF file show a completely 
> different glyph, such as α, β, γ, and so on. An application 
> can create a different encoding for each specific PDF 
> document-for example, in an attempt to obfuscate. More 
> likely, the PDF-generating software does this deliberately, 
> such as when a font with many characters is used but all the 
> text can be shown using only 256 different glyphs. In this 
> case, the software picks character names at random according 
> to the glyphs that are used."
> 
> Now if you use the example in attachment, you'll get the 
> following result when parsing err.pdf:
> 
> <<1 ><1 ><2 ><3 ><4 5 ><6 ><2 ><7 ><5 ><8 8 9 ><6 ><a ><5 ><7 ><b ><7
>  ><5 ><8 8 9 ><6 ><1 ><2 ><3 ><4 ><5 ><6 ><2 ><7 ><8 ><5 ><4 
> ><9 ><4 ><5
>  ><2 ><2 ><a ><2 ><2 ><1 ><8 ><b ><2 ><8 ><a ><b ><c ><4 d 
> ><3 ><6 ><e  ><f 4 d ><b ><2 ><4 5 ><8 >> and so on...
> 
> What do you see?
> The software that created your PDF used the (char) 1 for the 
> first glyph that was added, (char) 2 for the second glyph, 
> (char) 3 for the third, and so on...
> 
> There is no way for iText to know what the glyph corresponding with
> (char) 1 looks like. I mean: iText can find the paths that 
> were used to draw the glyph (two concentric circles could be 
> an O, two circles on top of each other could be an 8), and so on...
> 
> But iText doesn't do OCR, nor does any other F/OSS project.
> iText does a good effort to parse PDF documents, and if you 
> take the time to get your driver's license, I mean: if you 
> take the time to read the book before asking questions, you 
> fully understand that some PDFs files just can't be parsed.
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3147 - Release 
> Date: 09/20/10 10:04:00
> 

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Re: [iText-questions] invalid strings when doing textextract.

Reply via email to