Michael, Sorry for being so obtuse here but either you missed my point - or i'm not explaining myself well enough. What I don't understand is the correlation between the hex values and the difference array. I don't see anyway to map 0x1d to T even though I know and understand that T= /g55 and 55 + 29 = 84 which is T. Where can I find that 0x1d = /g55? If I knew that I could extract the text properly, at least for this one PDF.
Thanks for your time, Darren On Monday, October 13, 2014 3:14 AM, mkl <m...@wir-sind-cool.org> wrote: Darren, FDnC Red wrote > What I don't see is how do I know that the Text Stream > > [()40.2(\()18.3(")0()]TJ > > maps to the glyph names in the Differences array. First of all that extract from the content stream is incomplete. The pairs of round brackets (i.e. string delimiters) contain the following bytes: 1. pair: 0x1D 0x18 - both missing in your excerpt 2. pair: 0x5C 0x28 - an escaped (by the backslash 0x5C) opening bracket 3. pair: 0x22 - a double quote 4. pair: 0x1C - missing in your excerpt (When doing such an excerpt, always remember that you deal with arbitrary bytes here, not merely bytes properly mapping to characters in ASCII or Latin1 or Unicode! Your excerpt dropped all bytes in the control character range.) So here you see the > characters extracted as hex 0x1d 0x18 0x28 0x22 0x1c The differences array [2, /g51, /g85, /g82, /g77, /g72, /g70, /g87, /g48, /g68, /g81, /g88, /g79, /g76, /g71, /g74, /g54, /g83, /g73, /g86, /g38, /g56, /g43, /g36, /g44, /g3, /g9, /g40, /g55, /g53, /g50, /g49, /g15, /g47, /g24, /g19, /g90, /g75, /g25, /g37, /g20, /g28, /g23, /g21, /g11, /g12, /g16, /g27, /g22, /g45, /g41] maps these bytes as follows, and in combination with an offset 29 you get: 0x1d /g55 55+29=84 T 0x18 /g36 36+29=65 A 0x28 /g37 37+29=66 B 0x22 /g47 47+29=76 L 0x1c /g40 40+29=69 E This offset 29, while seeming arbitrary, can often be seen in the glyph indices in fonts. Regards, Michael -- View this message in context: http://itext-general.2136553.n4.nabble.com/Extracting-Text-tp4660444p4660451.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://p.sf.net/sfu/Zoho _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php ------------------------------------------------------------------------------ Comprehensive Server Monitoring with Site24x7. Monitor 10 servers for $9/Month. Get alerted through email, SMS, voice calls or mobile push notifications. Take corrective actions from your mobile device. http://p.sf.net/sfu/Zoho _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php