Hi Declan,

there are a lot of unsolved issues without any sample-document. Consequently it 
is difficult or nearly impossible to
reproduce the issue. Yours is one of them.
Please attach a sample-file to https://issues.apache.org/jira/browse/PDFBOX-289 
if possible.

Andreas

> Hello everyone,
> 
> I'm new, so please be gentle with me.
> 
> We are using PDFBox to extract text from a large amount of PDFs (approx. 
> 80,000) in preparation for indexing in Solr/Lucene.
> 
> In order to do this, we use the 
> org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages() method in order 
> to iterate over the pages and strip the contents using the 
> PDFTextStripper a page at a time.
> 
> The vast majority are fine, but approx. 0.8% suffer from a 
> NullPointerException when it reaches 
> org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102)
> 
> I'm currently working from the trunk after seeing a similar problem in 
> the archives 
> (<http://mail-archives.apache.org/mod_mbox/incubator-pdfbox-dev/200809.mbox/
> %3cof15421546.54f415dc-on862574ba.006a9e36-862574ba.006ad...@uscmail.uscourt
> s.gov%3E>) 
> but unfortunately it hasn't solved the issue.
> 
> The stack trace is:
> 
> Caused by: java.lang.NullPointerException
>               : at 
> org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102)
>               : at 
> org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:754)
>               : at 
> com.semantico.depp.extractor.PDFBoxPdfExtractor.writeText(PDFBoxPdfExtractor
> .java:71)
>               : at 
> com.semantico.depp.extractor.PDFBoxPdfExtractor.extractText(PDFBoxPdfExtract
> or.java:56)
>               : at com.semantico.depp.task.JobTask.doJob(JobTask.java:129)
> 
> Having delved into the code, the "page" variable is null when:
> 
> page.getDictionaryObject( COSName.COUNT )).intValue()
> 
> is called in PDPageNode.getCount(PDPageNode)
> 
> I understand that not all PDFs can be supported, and to be honest I 
> think 99.2% is amazing. I just thought I would post this in the hopes 
> that someone has come across it before.
> 
> Thanks for any help.
> 
> Regards,
> 
> Declan
> 

Reply via email to