[jira] [Commented] (PDFBOX-1716) PDDocument.getNumberOfPages() return 0 for certain PDF document

Thomas Chojecki (JIRA) Mon, 16 Sep 2013 00:13:09 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13768119#comment-13768119
 ]


Thomas Chojecki commented on PDFBOX-1716:
-----------------------------------------

I've tested it against the last 1.8.3 Snapshot.
The document is encrypted and use a objectstream for the pages. At the moment 
the objectstreams will be resolved on document decryption. So if you decrypt 
the document first, it should work fine. 

So try to do something like this:
pdDoc = new PDDocument(cosDoc); 
pdDoc.decrypt("");
pdDoc.getNumberOfPages()

The PDDocument also provide a function isEncrypted() which will return true in 
case the document is encrypted at the moment. After decrypting it will return 
false.

This should also work for at least pdfbox 1.8.2.
                
> PDDocument.getNumberOfPages() return 0 for certain PDF document
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-1716
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1716
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.2
>            Reporter: Tom
>             Fix For: 1.8.2
>
>
> Sample 
> document(https://issues.apache.org/jira/secure/attachment/12430914/FormI-9-English.pdf)
>  can be found here https://issues.apache.org/jira/browse/PDFBOX-578.  Looks 
> the NPE issue fix in that work item 
> https://issues.apache.org/jira/browse/PDFBOX-578 is a work around.
> When I try to extract the text content from /FormI-9-English.pdf , when I 
> call PDDocument.getNumberOfPages(), this method return 0 which makes the 
> extraction of the text content impossible:
> InputStream in = <PDF  InputStream>
> PDFParser parser = new PDFParser(content);
>                               PDFTextStripper pdfStripper = null;
>                               String parsedText = null;
>                               parser.parse();
>                               cosDoc = parser.getDocument();
>                               pdfStripper = new PDFTextStripper();
>                               pdDoc = new PDDocument(cosDoc);
>                               
>                               for(int i=1; i<= pdDoc.getNumberOfPages(); i++) 
> { // pdDoc.getNumberOfPages() return 0, which is incorrect
>                                 
>                                 }
> Note:
> 1. This problem is found in the PDFBox latest version 1.8.2
> 2. I didn't which component to file this defect, so please assign to the 
> correct component if needed, Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1716) PDDocument.getNumberOfPages() return 0 for certain PDF document

Reply via email to