Null pointer exception during text extraction ---------------------------------------------
Key: PDFBOX-350 URL: https://issues.apache.org/jira/browse/PDFBOX-350 Project: PDFBox Issue Type: Bug Components: Text extraction Reporter: Jukka Zitting [Issue from SourceForge] http://sourceforge.net/tracker/index.php?func=detail&aid=1934566&group_id=78314&atid=552832 Parsing the following document from the US gov website http://www.ssa.gov/multilanguage/Arabic/10101-AR.pdf Exception in thread "main" java.lang.NullPointerException at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:360) at org.pdfbox.util.operator.ShowText.process(ShowText.java:64) at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.pdfbox.ExtractText.main(ExtractText.java:244) This is caused by an unchecked reference c.equals( " " ) in line 377 of PDFStreamEngine.java changing this line to if( (string[i] == 0x20) && c != null && c.equals( " " ) ) eliminates the null pointer de-ref, but the output contains many ugly embedded nulls, which might be seen here as an excerpt يف اوشاع اذإ ماعطلا عباوطل نيلهؤملا بناجلأا ،يلي اميف �������null��� ماعطلا عباوط جمانرب دعاسي in one case the word null is printed several dozen times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.