Paul Durrant wrote:
> I'm trying to use 
>  iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1);
> 
> on the attached PDF but I don't get the text back, if I take the byte 
> array and look at the contents then
> the text block is not not in ASCII form although all the co-ordinate 
> structure is correct eg anything between the () is not in ASCII form, 
> how is it possible to get the text from this pdf

Open the document in File > Document Properties > Fonts
You'll see that a font TTE... was used with as encoding "Built-in".
Read chapters 11 and 15 of the second edition of "iText in Action"
and you should understand that this is an example where it's extremely
difficult to extract the text.

In any case: this is NOT a bug in iText.
This is a nice example of a PDF that can't be parsed with iText.

The encoding of a simple font is a sort of table where a maximum
of 256 characters are mapped with 256 glyphs. For standard encodings
the character 'a' corresponds with a glyph a, /a/ or *a*.

But anyone can use any other encoding where the character 'a'
corresponds with the glyph 'b', the character 'z' corresponds with
the glyph 'a', etc...

That's why you get stuff like this when you parse your file:
!" !"
  &$ ’ () (")
#$ $%
* +
  !"#$%&" ’())$ ’"* ++ + !","-!"’) ’ (.+ (’"’!&/
)(++$00() .+)$ ’(!"1
2 ’ (34 $).
, -- % ! -

These characters corresponds with glyphs, but '!' doesn't corresponds
with the glyph for '!'.

------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Reply via email to