(null) printed when characters cannot be decoded during text extraction
-----------------------------------------------------------------------

                 Key: PDFBOX-373
                 URL: https://issues.apache.org/jira/browse/PDFBOX-373
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 0.8.0-incubator
            Reporter: Brian Carrier
             Fix For: 0.8.0-incubator


We have some PDF files where the TO_UNICODE map is corrupt and PDFBox cannot 
extract the text.  font.encode() returns null and PDFStreamEngine.showString() 
adds the null to the result, which is then printed as "(null)". 

Here is a patch (against the trunk) that replaces the null with "?".  

--- PDFStreamEngine.java        2008-09-17 16:09:13.529318500 -0400
+++ PDFStreamEngine-new.java    2008-09-17 16:12:51.617318500 -0400
@@ -422,6 +422,11 @@
                 }
             }
 
+            // Replace a null entry with "?" so it is not printed as "(null)"
+            if (c == null)
+            {
+                c = "?";
+            }
             totalStringWidth += width;
             stringResult.append( c );
         }


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to