Paul Durrant wrote: > I'm trying to use > iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1); > > on the attached PDF but I don't get the text back, if I take the byte > array and look at the contents then > the text block is not not in ASCII form although all the co-ordinate > structure is correct eg anything between the () is not in ASCII form, > how is it possible to get the text from this pdf
Open the document in File > Document Properties > Fonts You'll see that a font TTE... was used with as encoding "Built-in". Read chapters 11 and 15 of the second edition of "iText in Action" and you should understand that this is an example where it's extremely difficult to extract the text. In any case: this is NOT a bug in iText. This is a nice example of a PDF that can't be parsed with iText. The encoding of a simple font is a sort of table where a maximum of 256 characters are mapped with 256 glyphs. For standard encodings the character 'a' corresponds with a glyph a, /a/ or *a*. But anyone can use any other encoding where the character 'a' corresponds with the glyph 'b', the character 'z' corresponds with the glyph 'a', etc... That's why you get stuff like this when you parse your file: !" !" &$ ’ () (") #$ $% * + !"#$%&" ’())$ ’"* ++ + !","-!"’) ’ (.+ (’"’!&/ )(++$00() .+)$ ’(!"1 2 ’ (34 $). , -- % ! - These characters corresponds with glyphs, but '!' doesn't corresponds with the glyph for '!'. ------------------------------------------------------------------------------ This SF.net Dev2Dev email is sponsored by: Show off your parallel programming skills. Enter the Intel(R) Threading Challenge 2010. http://p.sf.net/sfu/intel-thread-sfd _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/