[ 
https://issues.apache.org/jira/browse/PDFBOX-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom updated PDFBOX-1716:
------------------------

    Description: 
Sample 
document(https://issues.apache.org/jira/secure/attachment/12430914/FormI-9-English.pdf)
 can be found here https://issues.apache.org/jira/browse/PDFBOX-578.  Looks the 
NPE issue fix in that work item 
https://issues.apache.org/jira/browse/PDFBOX-578 is a work around.
When I try to extract the text content from /FormI-9-English.pdf , when I call 
PDDocument.getNumberOfPages(), this method return 0 which makes the extraction 
of the text content impossible:

InputStream in = <PDF  InputStream>
PDFParser parser = new PDFParser(content);
                                PDFTextStripper pdfStripper = null;
                                String parsedText = null;
                                parser.parse();
                                cosDoc = parser.getDocument();
                                pdfStripper = new PDFTextStripper();
                                pdDoc = new PDDocument(cosDoc);
                                
                                for(int i=1; i<= pdDoc.getNumberOfPages(); i++) 
{ // !!pdDoc.getNumberOfPages() return 0, which is incorrect!!
                                
                                }

Note:
1. This problem is found in the PDFBox latest version 1.8.2
2. I didn't which component to file this defect, so please assign to the 
correct component if needed, Thanks






  was:
Sample 
document(https://issues.apache.org/jira/secure/attachment/12430914/FormI-9-English.pdf)
 can be found here https://issues.apache.org/jira/browse/PDFBOX-578.  Looks the 
NPE issue fix in that work item 
https://issues.apache.org/jira/browse/PDFBOX-578 is a work around.
When I try to extract the text content from /FormI-9-English.pdf , when I call 
PDDocument.getNumberOfPages(), this method return 0 which makes the extraction 
of the text content impossible:

InputStream in = <PDF  InputStream>
PDFParser parser = new PDFParser(content);
                                PDFTextStripper pdfStripper = null;
                                String parsedText = null;
                                parser.parse();
                                cosDoc = parser.getDocument();
                                pdfStripper = new PDFTextStripper();
                                pdDoc = new PDDocument(cosDoc);
                                
                                for(int i=1; i<= pdDoc.getNumberOfPages(); i++) 
{ // !!pdDoc.getNumberOfPages() return 0, which is incorrect!!
                                
                                }

This problem is found in the PDFBox latest version 1.8.2






    
> PDDocument.getNumberOfPages() return 0 for certain PDF document
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-1716
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1716
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>            Reporter: Tom
>             Fix For: 1.8.2
>
>
> Sample 
> document(https://issues.apache.org/jira/secure/attachment/12430914/FormI-9-English.pdf)
>  can be found here https://issues.apache.org/jira/browse/PDFBOX-578.  Looks 
> the NPE issue fix in that work item 
> https://issues.apache.org/jira/browse/PDFBOX-578 is a work around.
> When I try to extract the text content from /FormI-9-English.pdf , when I 
> call PDDocument.getNumberOfPages(), this method return 0 which makes the 
> extraction of the text content impossible:
> InputStream in = <PDF  InputStream>
> PDFParser parser = new PDFParser(content);
>                               PDFTextStripper pdfStripper = null;
>                               String parsedText = null;
>                               parser.parse();
>                               cosDoc = parser.getDocument();
>                               pdfStripper = new PDFTextStripper();
>                               pdDoc = new PDDocument(cosDoc);
>                               
>                               for(int i=1; i<= pdDoc.getNumberOfPages(); i++) 
> { // !!pdDoc.getNumberOfPages() return 0, which is incorrect!!
>                                 
>                                 }
> Note:
> 1. This problem is found in the PDFBox latest version 1.8.2
> 2. I didn't which component to file this defect, so please assign to the 
> correct component if needed, Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to