There are a few open source OCR projects floating around out there. OCRopus is a c/c++ project hosted by Google: http://code.google.com/p/ocropus/
Tesseract-ocr is another c/c++ project hosted by google: http://code.google.com/p/tesseract-ocr/ I'm noticing a pattern... Apparently, the only pure-java ocr's are commercial, or awful. Fun choices. --Mark Storer Senior Software Engineer Cardiff.com import legalese.Disclaimer; Disclaimer<Cardiff> DisCard = null; > -----Original Message----- > From: 1T3XT info [mailto:[email protected]] > Sent: Tuesday, September 21, 2010 4:28 AM > To: Post all your questions about iText here > Cc: mp > Subject: Re: [iText-questions] invalid strings when doing textextract. > > On 20/09/2010 10:21, mp wrote: > > I attach a new pdf. I use your code: > > I get the same error. > > OK, the err.pdf you've sent is completely different from the > initial document. We've had a pointless discussion about the > same subject on the list a couple of weeks ago. > > Somebody said: iText can't parse this file. > We replied: no tool can parse this file. > Then the OP got angry thinking we didn't want to help him, > although we tried to explain with hands and feet what had happened. > > If you read chapter 2 of the book (I hope you took the time > to do that before starting an adventure with iText, just as > you take drivers lessons before you take place behind the > wheel of a car), you read: > Characters in a file are rendered on screen or on paper as glyphs. > ISO-32000-1, section 9.2.1, states: "A character is an > abstract symbol, whereas a glyph is a specific graphical > rendering of a character. For > example: The glyphs A, /A/, and *A* are renderings of the > abstract 'A' > character. Glyphs are organized into fonts. A font defines > glyphs for a particular character set." > > So a glyph on a page is not the same as a character. > > Now let's skip to chapter 11: > "Glyphs in a simple font are selected using a single byte. > Each glyph corresponds to a character that has a value from 0 > to 255. The mapping between the characters and the glyphs is > called the character encoding." > > If you have a language with more than 256 different glyphs, > and you want to use a font as a "simple font", it goes > without saying that you'll need a special encoding. The > character A won't necessary be mapped to a glyph that looks like an A. > > This is explained in chapter 15: > "It's possible for a PDF to have a font containing characters > that appear in a content stream as a, b, c, and so on, but > for which the shapes drawn in the PDF file show a completely > different glyph, such as α, β, γ, and so on. An application > can create a different encoding for each specific PDF > document-for example, in an attempt to obfuscate. More > likely, the PDF-generating software does this deliberately, > such as when a font with many characters is used but all the > text can be shown using only 256 different glyphs. In this > case, the software picks character names at random according > to the glyphs that are used." > > Now if you use the example in attachment, you'll get the > following result when parsing err.pdf: > > <<1 ><1 ><2 ><3 ><4 5 ><6 ><2 ><7 ><5 ><8 8 9 ><6 ><a ><5 ><7 ><b ><7 > ><5 ><8 8 9 ><6 ><1 ><2 ><3 ><4 ><5 ><6 ><2 ><7 ><8 ><5 ><4 > ><9 ><4 ><5 > ><2 ><2 ><a ><2 ><2 ><1 ><8 ><b ><2 ><8 ><a ><b ><c ><4 d > ><3 ><6 ><e ><f 4 d ><b ><2 ><4 5 ><8 >> and so on... > > What do you see? > The software that created your PDF used the (char) 1 for the > first glyph that was added, (char) 2 for the second glyph, > (char) 3 for the third, and so on... > > There is no way for iText to know what the glyph corresponding with > (char) 1 looks like. I mean: iText can find the paths that > were used to draw the glyph (two concentric circles could be > an O, two circles on top of each other could be an 8), and so on... > > But iText doesn't do OCR, nor does any other F/OSS project. > iText does a good effort to parse PDF documents, and if you > take the time to get your driver's license, I mean: if you > take the time to read the book before asking questions, you > fully understand that some PDFs files just can't be parsed. > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 9.0.851 / Virus Database: 271.1.1/3147 - Release > Date: 09/20/10 10:04:00 > ------------------------------------------------------------------------------ Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
