[jira] Commented: (PDFBOX-789) Error by text extraction

Jukka Zitting (JIRA) Mon, 09 Aug 2010 06:04:47 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896528#action_12896528
 ]


Jukka Zitting commented on PDFBOX-789:
--------------------------------------

The problem seems to be related to the large COSStream on page 134. I can avoid 
the issue easily enough with the following patch, but it would be better to 
find the root cause instead of relying on a workaround like this.

Index: pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java
===================================================================
--- pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java   (revision 
982911)
+++ pdfbox/src/main/java/org/apache/pdfbox/cos/COSString.java   (working copy)
@@ -191,7 +191,11 @@
             }
             catch( NumberFormatException e )
             {
-                throw new IOException( "Error: Expected hex number, actual='" 
+ hexChars + "'" );
+                retval.append( '?' );
             }
         }
         return retval;


> Error by text extraction
> ------------------------
>
>                 Key: PDFBOX-789
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-789
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: winndows xp, 
>            Reporter: Slavomir Varchula
>             Fix For: 1.3.0
>
>         Attachments: pdf_euba.pdf, Skuska.java
>
>
> Hello,  
> I tried to extract text from pdf and extraction ended with error. Here is 
> pdf, source file and stacktrace.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-789) Error by text extraction

Reply via email to