[
https://issues.apache.org/jira/browse/PDFBOX-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359065#comment-17359065
]
Andreas Lehmkühler commented on PDFBOX-5207:
--------------------------------------------
AFAIK nested arrays are not allowed as operand for a TJ operator. The pdf in
question has at least one array which is malformed (nested array, unbalanced
number of square braces). Before PDFBOX-5190 those arrays were skipped and now
the parser reads as much as possible. That nested arrays lead to an IOException
in
{{org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(COSArray)}}.
I'm thinking about skipping such nested arrays and continue with the remaining
part. In the current case the rendering is improved!!
BTW: we should think about a refactoring of
{{org.apache.pdfbox.pdfparser.PDFStreamParser}}. It uses COS-objects when
parsing a content stream. Although such content is very similar to COS-objects,
they aren't. This should simplify the parsing and should reduce the resources
to be used. But that is another story ...
> Page not rendered / extracted, Unknown type in array for TJ operation
> ---------------------------------------------------------------------
>
> Key: PDFBOX-5207
> URL: https://issues.apache.org/jira/browse/PDFBOX-5207
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.23
> Reporter: Tilman Hausherr
> Priority: Major
> Labels: regression
> Attachments: ContentStream.txt, evince-395-0.zip-0.pdf
>
>
> Worked in 2.0.23, no longer now. The weird thing is that the content stream
> (attached) is the same. It contains a "[" in an array at offset 4211.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]