I have an existing PDF that I'm trying to parse text out of, and am winding
up with a null pointer exception when reading an array in the content
stream.

I have narrowed the problem down to a particular line in the content stream
(if I run this one line through PdfContentParser.parse() it fails):

Here is the line (sorry this is so ugly - I'll describe the exact location
of the problem in a second):

[(*)-15(*)-15(&+,)(*)-15(-./0,123/45)(*)-15(/.)(*)-15(/2+,.)(*)-15(346/.7823/4)(*)-15(9,4,.82,:)(*)-15(;<)(*)-15(&+,)(*)-15(!==>52.823/4)(*)-15(?,4,.82/.)(*)-20(.,98.:349)(*)-20(2+,)(*)-20(=3@,=3+//:)(*)-20(/6)(*)-20(A8.3/>5)(*)-20(34A,527,42)(*)-20(/>)[(*)-15(*)-15(&+,)(*)-15(-./0,123/45)(*)-15(/.)(*)-15(/2+,.)(*)-15(346/.7823/4)(*)-15(9,4,.82,:)(*)-15(;<)(*)-15(&+,)(*)-15(!==>52.823/4)(*)-15(?,4,.82/.)(*)-20(.,98.:349)(*)-20(2+,)(*)-20(=3@,=3+//:)(*)-20(/6)(*)-20(A8.3/>5)(*)-20(34A,527,42)(*)-20(/>)(21/7,5)(*)-20(8.,)(*)-20(+<-/2+,2318=)(*)-20(34*)]
TJ


The problem is that there appears to be an open bracket [ in the middle of
this line.  If you search for -20(/>)[(*)-15  the problem is that open
bracket.  This makes the parser think it's reading an array inside the
array.  The ending ] then closes the inner array, and the whole thing blows
up.

At first blush, this looks like it's just a bad PDF.  But the trick is that
Acrobat parses and renders this thing just fine.

So my question is:  Is it possible that the above is actually valid per the
PDF spec, and we are just missing something with the tokeniser or parser? 
It wouldn't seem like it would valid.  But if that were the case, you'd
really think that Acrobat wouldn't be able to parse it, either.

Are we missing something in our parser, or is Acrobat doing some sort of
intense logic to reconstruct the Tj operation if the array doesn't terminate
properly?  I've done some thinking on this, and I see no reasonable strategy
for determining where in the content stream to insert an artificial ]


--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Content-stream-question-tp3946312p3946312.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to