[
https://issues.apache.org/jira/browse/PDFBOX-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896528#action_12896528
]
Jukka Zitting commented on PDFBOX-789:
--------------------------------------
The problem seems to be related to the large COSStream on page 134. I can avoid
the issue easily enough with the following patch, but it would be better to
find the root cause instead of relying on a workaround like this.
Index: pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java
===================================================================
--- pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java (revision
982911)
+++ pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java (working copy)
@@ -191,7 +191,11 @@
}
catch( NumberFormatException e )
{
- throw new IOException( "Error: Expected hex number, actual='"
+ hexChars + "'" );
+ retval.append( '?' );
}
}
return retval;
> Error by text extraction
> ------------------------
>
> Key: PDFBOX-789
> URL: https://issues.apache.org/jira/browse/PDFBOX-789
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.2.1
> Environment: winndows xp,
> Reporter: Slavomir Varchula
> Fix For: 1.3.0
>
> Attachments: pdf_euba.pdf, Skuska.java
>
>
> Hello,
> I tried to extract text from pdf and extraction ended with error. Here is
> pdf, source file and stacktrace.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.