Can we see the actual PDF?

--Mark Storer
  Senior Software Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer<Cardiff> DisCard = null;
 
Autonomy Corp., an HP Company
 

> -----Original Message-----
> From: Kevin Day [mailto:[email protected]] 
> Sent: Thursday, October 27, 2011 3:57 PM
> To: [email protected]
> Subject: [iText-questions] Content stream question
> 
> I have an existing PDF that I'm trying to parse text out of, 
> and am winding up with a null pointer exception when reading 
> an array in the content stream.
> 
> I have narrowed the problem down to a particular line in the 
> content stream (if I run this one line through 
> PdfContentParser.parse() it fails):
> 
> Here is the line (sorry this is so ugly - I'll describe the 
> exact location of the problem in a second):
> 
> [(*)-15(*)-15(&+,)(*)-15(-./0,123/45)(*)-15(/.)(*)-15(/2+,.)(*
)-15(346/.7823/4)(*)-15(9,4,.82,:)(*)-15(;<)(*)-15(&+,)(*)-15(!>
==>52.823/4)(*)-15(?,4,.82/.)(*)-20(.,98.:349)(*)-20(2+,)(*)-2
0(=3@,=3> +//:)(*)-20(/6)(*)-20(A8.3/>5)(*)-20(34A,527,42)(*)-20(/>)[(*)
> -15(*)-15(&+,)(*)-15(-./0,123/45)(*)-15(/.)(*)-15(/2+,.)(*)-15
(346/.7823/4)(*)-15(9,4,.82,:)(*)-15(;<)(*)-15(&+,)(*)-15(!>
==>52.823/4)(*)-15(?,4,.82/.)(*)-20(.,98.:349)(*)-20(2+,)(*)-2
0(=3@,=3> +//:)(*)-20(/6)(*)-20(A8.3/>5)(*)-20(34A,527,42)(*)-20(/>)(21/
> 7,5)(*)-20(8.,)(*)-20(+<-/2+,2318=)(*)-20(34*)]
> TJ
> 
> 
> The problem is that there appears to be an open bracket [ in 
> the middle of this line.  If you search for -20(/>)[(*)-15  
> the problem is that open bracket.  This makes the parser 
> think it's reading an array inside the array.  The ending ] 
> then closes the inner array, and the whole thing blows up.
> 
> At first blush, this looks like it's just a bad PDF.  But the 
> trick is that Acrobat parses and renders this thing just fine.
> 
> So my question is:  Is it possible that the above is actually 
> valid per the PDF spec, and we are just missing something 
> with the tokeniser or parser? 
> It wouldn't seem like it would valid.  But if that were the 
> case, you'd really think that Acrobat wouldn't be able to 
> parse it, either.
> 
> Are we missing something in our parser, or is Acrobat doing 
> some sort of intense logic to reconstruct the Tj operation if 
> the array doesn't terminate properly?  I've done some 
> thinking on this, and I see no reasonable strategy for 
> determining where in the content stream to insert an artificial ]
> 
> 
> --
> View this message in context: 
> http://itext-general.2136553.n4.nabble.com/Content-stream-ques
tion-tp3946312p3946312.html
> Sent from the iText - General mailing list archive at Nabble.com.
> 
> --------------------------------------------------------------
> ----------------
> The demand for IT networking professionals continues to grow, 
> and the demand for specialized networking skills is growing 
> even more rapidly.
> Take a complimentary Learning@Cisco Self-Assessment and learn 
> about Cisco certifications, training, and career opportunities. 
> http://p.sf.net/sfu/cisco-dev2dev
> _______________________________________________
> iText-questions mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> 
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered 
> with a reference to the iText book: 
> http://www.itextpdf.com/book/ Please check the keywords list 
> before you ask for examples: http://itextpdf.com/themes/keywords.php
> 
> 

------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to